String handling routines

ivan-pi commented 4 years ago

Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:

split - given a separator, splits the string into some form of array
upper/lower - convert a character string to all upper/lower case

The discussion also mentioned the proposed iso_varying_string module, which was supposed to include some string routines. I found three distinct implementations of this module:

ISO_VARYING_STRING Module by Rich Townsend
iso_varying_string implementation by Brad Richardson (@everythingfunctional)
ISO_VARYING_STRING due to J.L.Schonfelder (author of the iso_varying_string proposal; the module dates back to 1998)

I also found the following Fortran libraries targeting string handling:

Strings For Fortran by Brad Richardson @everythingfunctional
Fortran Character String Utilities by George Benthien
M_STRINGS from Urban Jost @urbanjost (part of General-Purpose Fortran tools)
String_Functions by David Frank
StringiFor by Stefano Szaghi @szaghi
fortranString from @bceverly
fortran-string-utility-module by @tomedunn
fortran-string from @dongli
strings by @jchristopherson
fortran_libstring from @koiking213
fortran-strings by @eengl
flibs by @arjenmarkus contains several modules for handling strings
functional-fortran by @milancurcic implements several functions on strings
ZstdFortranLib by @zbeekman has some conversion to/from other intrinsic kinds, sub, gsub, split, join, and conversion on concatenation. WIP though
LibString, F77 string library by Giulio Vistoli & Alex Pedretti

It is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.

For comparison here are some links to descriptions of string handling functions in other programming languages:

Python: String Methods, String constants, custom string formatting, template strings
Ruby: String class & methods
D: std.string, std.utf, std.path, std.regex, std.ascii (see related issue #11), std.encoding, std.windows.charset, std.conv
C: C string handling on wikipedia
C++: std::string class, \<string> header, C++ string handling on Wikipedia, Boost libraries for string and text processing
Julia: Strings, Common operations
MATLAB: Characters and Strings
Rust: Strings - Rust By Example, Primitive Type str, Struct std::string::String

Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).

My suggestion would be to start with some of the easy functions like capitalize, count, endswith, startswith, upper, lower, and the conversion routines from numeric types to strings and vice-versa.

ivan-pi commented 4 years ago

I was hoping to also have some document that compares different languages, like @ivan-pi did above, so that we can stay compatible with the API, whenever it makes sense.

I realized there is a page for this on Wikipedia: https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(string_functions)

As you might notice many of the string functions do not have a Fortran equivalent. Exactly those are the ones we should try and implement here.

ivan-pi commented 4 years ago

I don't know if the best way is to open a few draft PRs with the document (I assume there will be a few competing versions at first), or if there is another way to discuss a full document.

By document are you implying simply a markdown file? (ideally these string function specifications would later become part of the documentation)

certik commented 4 years ago

@ivan-pi the wikipedia page you posted is perfect, that's what I was hoping to create. Now when this is done, I think all we need is just some markdown document where we start listing the Fortran functions and their names and arguments, and we will consult the wikipedia page to see how other languages do that. Maybe even Wiki would be good enough to start. Do you want to start a wiki page at https://github.com/fortran-lang/stdlib/wiki ?

urbanjost commented 4 years ago

\ Missed this. It looks bright with promise. A lot to catch up on. A scanned the discussion and thought it was worth noting that conversion from a numeric value to a string is almost always to build a message. Overloading the concatenation operator is a great way to do that, but one of my favorite approaches is demonstrated in the MSG() function in the M_STRINGS module mentioned in the top post of the discussion. It allows constructs like

message=msg('The value is',A,'which squared is ',A**2,'and it is either ',.true.,'or not')

I find I use that type of function a lot.

More basic is that I saw mention of not using internal I/O to do the conversion to a string. That is really important if you want your function to be safely called from an I/O function such as a WRITE, and to be able to use it from within a PURE function. The algorithms to do a floating point value as well as an INTEGER value can be found back as far as the 1968 version of "Software Tools" in Ratfor!

But in order of frustration that they are not there in Fortran is case conversion, then string splitting, then conversion to and from strings, then Regular Expressions.

And I like the name "center" for centering a string, but I would vote for the name ADJUSTC() for a function that works on CHARACTER variables to complete the set (ADJUSTL(), ADJUSTR(), ...).

A lot to catch up on here but a great step in the right direction, I think. Is the main objective to create a specification for the Fortran standard or to create a working library in the Public Domain (or both?). It wasn't quite clear to me. Assuming a working library emerges what will the LICENSE be and how will it be hosted and documented? Maybe I missed that part of the discussion? I'm not sure I'm starting in the right place, actually.

As a footnote, the next most common issue being solved when converting a numeric value to a string is to present the number in a form the language does not support, like a non-standard base or perhaps in Roman numerals or in English text, for example.

ivan-pi commented 4 years ago

More basic is that I saw mention of not using internal I/O to do the conversion to a string. That is really important if you want your function to be safely called from an I/O function such as a WRITE, and to be able to use it from within a PURE function. The algorithms to do a floating point value as well as an INTEGER value can be found back as far as the 1968 version of "Software Tools" in Ratfor!

Did you avoid internal I/O in your library M_STRING? I never really thought about is internal I/O pure or not. But if that is really the case I would prefer to do our own string conversions routines.

And I like the name "center" for centering a string, but I would vote for the name ADJUSTC() for a function that works on CHARACTER variables to complete the set (ADJUSTL(), ADJUSTR(), ...).

Would you allow adjustc to have an optional width parameter or not? I like this name idea.

Is the main objective to create a specification for the Fortran standard or to create a working library in the Public Domain (or both?)

I think the initial purpose is to create a working library in the public domain (MIT or BSD license) which hopefully becomes standard in the feature. I guess a similar model is how some of the C++ Boost function libraries later become integrated into the language or the C++ standard library. You can read the related thread https://github.com/j3-fortran/fortran_proposals/issues/104 for more information.

urbanjost commented 4 years ago

After checking the 2018 standard it says specifically the unit cannot be an integer or asterisk, which implies internal I/O is OK; but my compiler gives an error if I include any WRITE statement, so I guess that internal I/O appears to be allowed.

I do have versions of a metamorphic class that takes any intrinsic type and does no I/O but it is not in the github versions of M_STRINGS. I was actually going to make a version this week, by coincidence. Now that I see it is my compiler and not the standard stopping me from declaring the procedure PURE and I am trying to read through everything here I think I will play catch-up here instead.

On January 8, 2020 at 3:03 AM Ivan notifications@github.com wrote:
    > > 
    More basic is that I saw mention of not using internal I/O to do the conversion to a string. That is really important if you want your function to be safely called from an I/O function such as a WRITE, and to be able to use it from within a PURE function. The algorithms to do a floating point value as well as an INTEGER value can be found back as far as the 1968 version of "Software Tools" in Ratfor!

> 
Did you avoid internal I/O in your library M_STRING? I never really thought about is internal I/O pure or not. But if that is really the case I would prefer to do our own string conversions routines.

And I like the name "center" for centering a string, but I would vote for the name ADJUSTC() for a function that works on CHARACTER variables to complete the set (ADJUSTL(), ADJUSTR(), ...).

Would you allow adjustc to have an optional width parameter or not? I like this name idea.,
I find the length useful particularly with centering, but find overloading adjustl and adjustr to have a length parameter useful too. With centering in particular you are often doing something like TITLE=adjustc("my title",len(title)) and it takes quite a bit of fiddling using another routine around the fixed text otherwise.
    > > 
    Is the main objective to create a specification for the Fortran standard or to create a working library in the Public Domain (or both?)

> 
I think the initial purpose is to create a working library in the public domain (MIT or BSD license) which hopefully becomes standard in the feature. I guess a similar model is how some of the C++ Boost function libraries later become integrated into the language or the C++ standard library. You can read the related thread [j3-fortran/fortran_proposals/#104](j3-fortran/fortran_proposals#104 https://github.com/j3-fortran/fortran_proposals/issues/104 for more information.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/69?email_source=notifications&email_token=AHDWN3J53MZOILMO7WSPY7DQ4WCDXA5CNFSM4KCFW352YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEILQSII#issuecomment-571935009 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDWN3KRISWEN277RKGU3VLQ4WCDXANCNFSM4KCFW35Q .

arjenmarkus commented 4 years ago

Re regexp: a couple of years ago I found a series of articles by Russ Cox about regular expression engines. The accompanying source code is not too difficult and I made a start rewriting it in Fortran. However, there is a flaw in my code and I never got around to correct it (the code is not too difficult but it is manipulating lists). I picked it up again. The result will not be a full-fledged RE engine à la PCRE, but it will be Fortran only and it should be useable for not entirely trivial tasks.

everythingfunctional commented 4 years ago

I never really thought about is internal I/O pure or not. But if that is really the case I would prefer to do our own string conversions routines.

Based on the fact that I was able to write pure toString functions here, internal I/O is pure.

certik commented 4 years ago

@urbanjost thanks for the post. As @ivan-pi replied, the goal of this stdlib effort is to provide a Fortran Standard Library, i.e., both a library and a specification. See my answer to a similar question. The license is MIT (https://github.com/fortran-lang/stdlib/blob/006bedafc0d40ff381da2bd4455f61b5e11fc2ee/LICENSE), and we will only depend on 3rd party code that is MIT or BSD style licensed. The way we plan to achieve our goal it to have a large community designing the API and a rigorous (high bar) process to get new features in as documented in our WORKFLOW document. And we have been coordinating with the Fortran Standards Committee (this effort started at the J3 committee repository at https://github.com/j3-fortran/fortran_proposals/issues/104), and also we are planning to getting them involved in the step 5 in the workflow (at least informally). Our goal is to get a wide community agreement and acceptance to adopt stdlib as the Fortran Standard Library. We will continue working closely with the Fortran Standard Committee and coordinate with them. I can imagine many arrangements in the future, up to even the Fortran Standard itself specifying a Standard Library; but that is far in the future. Right now our job is to get the community to agree on the APIs and to provide specifications and implementations and to build a community around it.

(Update: we added the motivation into README: https://github.com/fortran-lang/stdlib#goals-and-motivation)

milancurcic commented 4 years ago

datetime-fortran used internal I/O in a pure function for a long time (since the beginning I think) and this built fine with gfortran and ifort.

https://github.com/wavebitscientific/datetime-fortran/blob/d4683303e6319b6380bbf7717164f7d8f18e0f0d/src/lib/mod_datetime.f90#L1288

awvwgk commented 3 years ago

After reading through this thread I found subtle issue with the proposed low-level API for character(len=*) variables.

My top three name picks are:

uppercase/lowercase/titlecase

to_upper/to_lower/to_title

upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

Taking just the basic functionality mentioned by @ivan-pi here, I implemented a stdlib_character module as demonstration in #310, the issue becomes quickly apparent once you try to use both stdlib_character and stdlib_ascii in the same scope.

awvwgk commented 3 years ago

I created an exploratory implementation of a functional string handling at awvwgk/stdlib_string as fpm project. A non-fancy string type is implemented there, which basically provides the same functionality as a deferred length character but can be used in an elemental rather than a pure way. The idea is to have a scaffold for the string type in stdlib which can be extended later but already provides everything we are used to have from the deferred length character without the rough edges.

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

there is no assignment from string to character
- reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS
all procedures return a fixed length character rather than a string instance
- reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment
- drawback: assigning the return value to a string might create a temporary variable on the stack
no support for get and put
- reason: derived type IO is used instead

certik commented 3 years ago

@awvwgk this would be the high level API that operates on the string_type type.

How would a low level API look? Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

The maybe function can also operate on character(len=:), allocatable it seems. So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

awvwgk commented 3 years ago

Let's look at some examples, say the read_formatted function. It doesn't need the string_type, it could operate on character(len=:), allocatable directly, correct?

Bad example, the read_formatted procedure defines a user defined derived type input (see #312), which cannot be defined for character(len=:), allocatable since there is already an intrinsic formatted read transfer for character(len=*) types defined.

this would be the high level API that operates on the string_type type.

The idea so far was to provide the intrinsic low level API for a string type, on which later the high level API can be defined.

So it seems the low level API code would be considerably simpler, given that most of that file is a wrapper of character(len=:), allocatable into string_type, correct?

Exactly, I wanted to explore a common basis of agreed on functions for a future high level string object. The minimal agreed on basis should be easily all the intrinsic procedures defined for character(len=*).

How would a low level API look?

I decided to pick the part of the high-level API that will have no overlap with a potential low-level API. This way the low level API can be explored separately, like in #310

The maybe function can also operate on character(len=:), allocatable it seems.

This one was chosen deliberately to be an internal implementation detail, i.e. it is not part of the public API.

certik commented 3 years ago

Here is what I mean: https://github.com/awvwgk/stdlib_string/pull/1

In that PR, I implemented a low level version of read_formatted called read_formatted0 that operates on character(len=:), allocatable. It seems to work.

ivan-pi commented 3 years ago

@awvwgk this would be the high level API that operates on the string_type type.

@certik, the procedures in Sebastian's module are in fact equivalents of the intrinsic character procedures already available in Fortran:

    public :: len, len_trim, trim, index, scan, verify, repeat, adjustr, adjustl
    public :: lgt, lge, llt, lle, char, ichar, iachar
    public :: assignment(=)
    public :: operator(.gt.), operator(.ge.), operator(.lt.), operator(.le.)
    public :: operator(.eq.), operator(.ne.), operator(//)
    public :: write(formatted), write(unformatted)
    public :: read(formatted), read(unformatted)

The pull request #310 is the first to propose new procedures (reverse, to_title, to_upper, to_lower) which can operate both on the intrinsic character(len=*) variables, and in a later pull request also on a high-level string type (to be decided).

String processing in Fortran is not that bad, considering the number of procedures already there. If we could add casing, numeric to string convertors (and vice-versa), join and split, and perhaps a few more procedures, I think most usage cases would be covered.

certik commented 3 years ago

@ivan-pi actually there is genuine new functionality, that I just extracted here: https://github.com/awvwgk/stdlib_string/pull/1#issuecomment-776927915.

awvwgk commented 3 years ago

@certik I see, you are right the there is new functionality, the new_string_from_chars function extends beyond intrinsic functionality. I would suggest that I will submit it as a separate patch to the existing stdlib_ascii module and remove it from the stdlib_string_type until we have agreed on the low-level API. (new issue at #315)

Regarding the len functionality I find the argument is bit stretched, but I'm willing to follow it for the sake of the discussion. Eventually, the argument boils down to whether the utility function maybe should be exposed as public API. I think it is an implementation detail, because the specs as proposed do not define the internal representation of the character sequences, the string_type could use character(len=1), allocatable :: raw(:) instead of character(len=:), allocatable :: raw like in most iso_varying_string implementations.

The maybe, ok and err functionality from Rust come to mind here and would make a great addition for stdlib as well if we can successfully emulate this kind of behaviour. I don't feel confident that I got a good spin on this kind of features to make it stable enough for an actual addition to stdlib, therefore I don't want to force a maybe implementation for a character(len=:), allocatable yet.

However, I disagree on the low level API for user defined derived type input output, it is strictly a feature that can only be defined for a derived type but not an intrinsic and we won't be able to make use of it to safely read into a character(len=:), allocatable :: dlc with read(unit, *) dlc due to the construction of the Fortran standard.

The gist is, I don't want to introduce new functionality beyond the existing character(len=:), allocatable with the string_type on the first pass. Just one step at a time to allow better focus.

certik commented 3 years ago

@awvwgk I just saw your comment, my comment here I think replies to yours: https://github.com/awvwgk/stdlib_string/pull/1#issuecomment-777027885.

awvwgk commented 3 years ago

There is now also a branch at my stdlib fork. There is one really unfortunate thing here, GCC 7 and 8 do not support evaluation of user defined pure procedures in variable declarations. Adopting this string_type will inevitably drop support for GCC 7 and 8.

The solution is to adopt the iso_varying_string strategy to return a string_type instead of a fixed length character, which comes with its own problem that results from stdlib_string_type procedures now must be explicitly cast back to character form.

awvwgk commented 3 years ago

As promised in https://github.com/fortran-lang/stdlib/pull/320#issuecomment-779351129 I tried to devise an abstract base class (ABC) for an extendible string class. This one turned out much more difficult to design than a non-extendible functional string type, you can check the base class definition here:

https://github.com/awvwgk/stdlib_string/blob/string-class/src/stdlib_string_class.f90

The class is a bit more bloated than it has to be because I made it compatible with the intrinsic character type and the functional string type as well to ease testing.

One thing that turns out to be very difficult to account for are overloaded intrinsic procedures, you can find two implementation for each intrinsic procedure (except for the lexical comparison where I took a shortcut), one for the overloaded generic interface (len(string)) and a type bound implementation (string%get_len()), with the former invoking the latter. This was necessary to allow using the overloaded intrinsic procedure names while still relying on the runtime resolution of the type bound procedures from the object.

Another problem was returning a class polymorphic object from a procedure (operator(//) or trim), returning class(string_class), allocatable would force users to declare their string objects always as class polymorphic even if they want to use a specific implementation. Therefore, I decided to return a functional string_type instance instead and provide an assignment from string_type to a polymorphic string_class object to hide this fact effectively.

Since we have a whole lot of intrinsic character procedures implementing a string class based on this ABC can become tedious, therefore I designed the ABC to provide mock implementations based on the setter (assignment(=)) and getter (char(self)) functionality which can optionally be overwritten. Only the assignment from a character variable and the three char functionalities are actually deferred and must be provided in a minimal implementation.

While this is not a final specification yet, I wanted to share it as aid for discussion functional vs. object oriented implementation of a string in stdlib. From the above notes you might gather that a truly extendible string class could result in significant performance penalties for the user. Still there might be some value in having a string object available.

ivan-pi commented 3 years ago

The overall implementation comes close to iso_varying_string, but it is not an iso_varying_string implementation. The main difference to iso_varying_string are

there is no assignment from string to character

reason: there can be no assignment defined which covers both fixed length characters and deferred length characters as LHS

If I understand things correctly, the assignment to character should be handled explicitly through the char function? I.e.

type(varying_string) :: varying   ! from iso_varying_string
type(string_type) :: nonfancy     ! from PR #320 
character(len=20) :: flc 
character(len=:), allocatable :: dlc

flc = varying ! works
dlc = varying ! fails, dlc needs to be allocated first

allocate(character(len=len(varying)) :: dlc)
dlc = varying ! works

flc = char(nonfancy) ! works
dlc = char(nonfancy) ! works

all procedures return a fixed length character rather than a string instance

reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment

drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

no support for get and put

reason: derived type IO is used instead

:+1: This is better and more Fortranic IMO. put and get where borrowed from C.

awvwgk commented 3 years ago

all procedures return a fixed length character rather than a string instance

reason: returning a derived type makes the handling of string types more involved, instead the fixed length character is converted back to a string type by assignment

drawback: assigning the return value to a string might create a temporary variable on the stack

Which procedures does this hold for?

None, because I had to reconsider this design choice due to missing compiler support.

ivan-pi commented 3 years ago

Does the initial design choice (the one which breaks GCC 7 and 8 support) survive in any of the earlier commits on your private fork? I wonder if you could still pull it off, by moving the functions out of a module...

I still don't fully grasp how the implementation differed. Would for example the repeat(string, ncopies) accept a type(string_type), and use an overloaded pure len function to return a fixed-size result of size len(string)*ncopies?

In any case your pull request is a big step to make string-handling easier.

awvwgk commented 3 years ago

@ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way.

gronki commented 3 years ago

I have recently learned that overloading an assignment operator is a mistake in most cases. For example, appending one element to an allocatable array using the notation:

string_array = [string_array, string("new_string")]

will not work.

With this design flaw of a language, I'd argue that overloading assignments should be avoided at all cost.

Dominik

pt., 5 mar 2021 o 13:48 Sebastian Ehlert notifications@github.com napisał(a):

@ivan-pi https://github.com/ivan-pi See https://github.com/awvwgk/stdlib_string/tree/a2833b6dd3b21abc42f8854a7fc3049eaf9b39ff for a version based entirely on returned character values. I think this version could run into problems when used in an elemental way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/69#issuecomment-791398357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4NA3N5NJ3COJ2CJF3KNWLTCDHJRANCNFSM4KCFW35Q .

awvwgk commented 3 years ago

I updated my stdlib_string project with an abstract base class for a more object-oriented string implementation. As a demonstration of such a string_class I added @robertrueger's ftlString and @szaghi's StringiFor projects as examples to the repository, but based them on string_class rather than having them implement the intrinsic functions themselves. This could allow to make existing string libraries easily compatible with stdlib by allowing them to inherit from string_class (and they would also become compatible with each other).

ivan-pi commented 3 years ago

Not sure if it was linked before, Clive Page wrote a nice summary about character types in Fortran: https://fortran.bcs.org/2015/suggestion_string_handling.pdf

There was also a thread over at the Fortran-FOSS programmers: https://github.com/Fortran-FOSS-Programmers/Fortran-202X-Proposals/issues/4

A link was provided to a WG5 document, which talks about a print() function (page 9) similar to what we have now as to_string(): https://wg5-fortran.org/N1951-N2000/N1972.pdf

fortran-lang / stdlib

String handling routines #69