Functional vs. object-oriented string handling

fortran-lang / stdlib

Fortran Standard Library

https://stdlib.fortran-lang.org

MIT License

1.06k stars 164 forks source link

Functional vs. object-oriented string handling #334

Open awvwgk opened 3 years ago

awvwgk commented 3 years ago

To explore the capabilities of string handling I have implemented two kinds of strings to far.

non-extendible, functional string_type (see #320), similar to iso_varying_string
abstract base class string_class (see #330)

Currently both implementation provide only the bare minimum functionality of the intrinsic deferred length character variables.

There are two type of questions to answer here:

should the experimental namespace of stdlib provide overlapping functionality
do we prefer either of the string implementations or do we want to look for something else

Related to #330 implementing an abstract base class for string objects. For prior discussions of strings see #69.

ivan-pi commented 3 years ago

What would go into the category something else? (that is apart from a new standardized intrinsic string type)

Personally, I am happy with the functional (non-extendible) string type. Looking at the list of auxiliary methods in StringiFor, I think all of them can be implemented just as easily as functions (and not type-bound methods).

That said, one benefit of the object-oriented string class is that it suffices to import only the type in order to use the type-bound methods.

I would add that in C++ it is generally frowned upon to inherit from std::string (e.g. see Why should one not derive from c++ std string class?). Since we currently don't have any plans to introduce a string inheritance hierarchy with polymorphic behavior, I'm not convinced about the benefits of a string class.

wclodius2 commented 3 years ago

FWIW for a "string type" to supplant the intrinsic character I would make the internal representation an integer array so that it is straight forward to extend it to represent UCS/Unicode. The integer type could be either INT8 if a UTF-8 representation is desired, INT16 for a UTF-16 representation, or INT32 for UTF-32. I would expect the UTF-32 representation would be the most straight-forward to implement and best for East Asian ideographs, UTF-8 would be the most efficient for most European and Semetic languages, UTF-16 the most efficient for most of the rest of the world.