Include a `split` function (202X feature)

certik commented 3 years ago

Fortran 202X will include a new intrinsic split function:

https://github.com/j3-fortran/fortran_proposals/issues/187

It was approved, and then the API was changed after approving it. We should implement the latest approved version in stdlib, and play with it and ensure that the API looks good. And if we discover some improvements, we should propose them at the February 2021 Fortran Standards meeting.

Then once split becomes part of the next Fortran standard, we can have a section in stdlib called "backwards compatibility", where we can have a "reference implementation" of such new features, so that people can use them right away even if some compiler might not support them yet.

milancurcic commented 3 years ago

I put an implementation here: https://github.com/milancurcic/fortran202x_split.

This is a "naive" implementation--I went for what seemed to me as the simplest solution. The split_first_last specific subroutine, which is the second form listed in 20-139, does much of the grunt work. I wrote it in a mostly functional style, so it does some unnecessary copies that we may want to refactor in an imperative style.

For tests so far I only used the three examples from 16.9.194 in 20-007. They seem okay. More tests will be needed at the time of PR for stdlib.

At this time, I'd like to get feedback on this before I prepare a PR for stdlib.

jvdp1 commented 3 years ago

Thank you for the implementation. I played a bit with it and it looks good to me. The API seemed a bit strange at the start. I wonder if the version with pos is not a bit overlapping with the intrinsic function index (note: I understand that this is a implementation of the proposed standard)

certik commented 3 years ago

@jvdp1 thanks for the feedback. That is precisely why I suggested we do this, to get more experience about the API and perhaps propose some changes to it before it gets standardized. @jvdp1 how would you change the API to be more natural?

milancurcic commented 3 years ago

Here's my impression of the API.

Input dummy argument set is a scalar character string that contains all separators to be used for delimiting tokens. For example, if you pass ";, " as a value of set, then any of these characters will be used for delimiting. In this current definition, I don't think it's possible to delimit using a multiple-character string, although I can't think of a use case for this. It seems to me that a more intuitive API would be for set to be an array of characters, so:

character, intent(in) :: set(:)

instead of

character(*), intent(in) :: set

Then you'd pass it as [";", ",", " "] instead of ";, ". But this is I think more a matter of style than anything else.

I found the API to be idiomatic Fortran--subroutines with intent(in out) arguments for output--which is not in itself a bad thing and helps minimize unnecessary copies. I only wish there was also a convenience function for when a user only wants the tokens as a result and nothing else, and doesn't mind an extra copy. The interface would look like this:

pure function split(string, set) result(tokens)
  character(*), intent(in) :: string
  character(*), intent(in) :: set
  character(:), allocatable :: tokens(:)

and then you call it like this:

tokens = split(string, set)

which would allow a more functional style by passing split(string, set) as an argument to other functions.

So, to that end, I'd propose that in the stdlib we also include this 4th specific procedure, even if it ends up being non-standard, or an extension. So we'd have a total of 4 forms of split:

subroutine split(string, set, tokens, separator)
subroutine split(string, set, first, last)
subroutine split(string, set, pos, back)
function split(string, set)

ivan-pi commented 3 years ago

Unfortunately in Fortran you can not force functions and subroutines under the same interface. So the last split has to be named differently.

milancurcic commented 3 years ago

Yes, I remembered this rule later. If people desire this as a function, maybe string_tokens, or strtok to mirror the C analog.

jvdp1 commented 3 years ago

@jvdp1 thanks for the feedback. That is precisely why I suggested we do this, to get more experience about the API and perhaps propose some changes to it before it gets standardized. @jvdp1 how would you change the API to be more natural?

First I expected a function (similar to scan that has a similar interface (scan(string, set [,back [,kind]]). I was thus a bit surprised that it was a subroutine. However it does make sense with the different ouputs.

Second, I don't see the value of the return value pos, if you can get separator, or first. (Maybe efficiency?) I must miss something there (note that I didn't read all the discussions about this proposition). Furthermore this version of split is quite similar to the intrinsic function index.

Finally, I was wondering why last was not optional. Isn't it that last(1:ntokens-1)=fist(2:ntokens)-1 (or somehting like that)? Do I miss something?

Anyway, this version would be already a great addition in stdlib and later in the Standard.

milancurcic commented 3 years ago

Yes, pos is useful for efficiency if you're searching for a delimiter at a specific position in the string. Both other forms parse the whole string.

You can't predict the values of last based on values of first because if token(n) is an empty string, last(n) is first(n)-1.

certik commented 3 years ago

We should also do comparisons with Python and other languages, as we usually do for stdlib.

On Sun, Nov 8, 2020, at 11:06 PM, Milan Curcic wrote:

Yes, pos is useful for efficiency if you're searching for a delimiter at a specific position in the string. Both other forms parse the whole string.

You can't predict the values of last based on values of first because if token(n) is an empty string, last(n) is first(n)-1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/241#issuecomment-723780868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWHVKFSQ2PSCCE7OD43SO6BHTANCNFSM4TBIUC4Q.

milancurcic commented 3 years ago

Similar capability in other languages:

milancurcic commented 3 years ago

The key differences seem to be:

Behavior of the separator. Fortran split works like that of C and C++ strtok (multiple characters in a string are possible separators), whereas in Python, Julia, MATLAB, and Go, the whole delimiter string is used as a single delimiter. Rust's delimiter allows specifying a pattern. For a few of these implementations, delimiter defaults to an empty space if omitted;
Result value. Fortran split updates the in out argument in-place. strtok returns a pointer to the beginning of the tokens sequence. Others simply return a list or array of strings as function result.
Unlike others, Fortran split allows returning first and last indices instead of tokens. If this functionality exists in other languages, it's defined in some other function. So, Fortran here packs multiple functionalities under one name, a design that I'm not fond of, but can get used to it.

certik commented 3 years ago

Thanks @milancurcic, very helpful, exactly what I was looking for. The first example:

https://github.com/milancurcic/fortran202x_split/blob/a49ccf5b0775732cc4d1087c71fd513d4e921a6a/test/main.f90#L13

I think is natural. That corresponds to the function split_tokens:

https://github.com/milancurcic/fortran202x_split/blob/a49ccf5b0775732cc4d1087c71fd513d4e921a6a/src/fortran202x_split.f90#L14

So I think I like the API of split_tokens.

I am not a fond of bundling the other functionality into the same split overload. In terms of usability, I think the split_tokens will be the most useful and helpful. The other split_first_last and split_pos might not be as useful, it's hard to tell. I would feel much better if they could be used in stdlib in the experimental section first and see.

jvdp1 commented 3 years ago

I am also not fond of the current API, but even like that, it would be a nice addition. I would suggest to add the current version in the experimental namespace of stdlib, such that people can test it.

milancurcic commented 3 years ago

I added the string_tokens function which is a thin wrapper around split_tokens.

This allows the user to do:

tokens = string_tokens(string, set)

There is also a simple benchmark program in app/main.f90 to compare the run-time between string_tokens function and the equivalent split_tokens subroutine, given a decent size input string (data included in the repo). On my laptop:

$ fpm run
 split subroutine, elapsed  0.482727677     seconds
 string_tokens function, elapsed  0.565418363     seconds

jvdp1 commented 3 years ago

Thank you @milancurcic for the function and the test. Can the difference be explained by calling the subroutine inside the function? Or was the subroutine inlined in the function?

milancurcic commented 3 years ago

The subroutine is not inlined:

  pure function string_tokens(string, set) result(tokens)
    !! Splits a string into tokens using characters in set as token delimiters.
    character(*), intent(in) :: string
    character(*), intent(in) :: set
    character(:), allocatable :: tokens(:)
    call split_tokens(string, set, tokens)
  end function string_tokens

but it allocates the function result before returning it to the caller. It's this extra allocation that makes the difference.

There is probably some minimal overhead with calling the subroutine, but it should be negligible.

jvdp1 commented 3 years ago

A long discussion on the split function was held during the Fortran Monthly call (November 2020) (from5:39).

milancurcic commented 3 years ago

Comment from Youtube user ES on https://youtube.com/watch?v=HI-Yhn7Q8Ko:

Here's how I might choose to implement split in C++ (but you should be able to do it similarly in Fortran (which I'm still learning)):

The function or subroutine would have a string input, a set of delimiters input, and it would return/modify a "split_string" object whose structure would look like:

members: the string itself, an array of index locations of the delimiters in the string

member functions: A "get_nth_tocken" function to return the nth token of the string. Or even overload the operator[].

This way I keep allocations small and to a minimum, and scan through the string only once to construct the index array.

In my opinion, the best solution would have to involve a derived type because of the ragged array nature of the collection of tokens. Otherwise, the only solution that makes sense to me in the discussion is the split_first_last and the "find" version.

certik commented 3 years ago

@esterjo thank you and welcome! I personally agree and would like new additions to Fortran to first go into a library, get the API ironed out, get some usage, and only then propose it for the Fortran standard itself if there is interest. But other people at the committee have a different opinion on this and voted to have this in the language itself right away. So the second best we can do is to implement this into stdlib and identify potential issues with the API, and submit proposals for the committee to fix some of the issues that we found, and that is what @milancurcic plans to do. :)

esterjo commented 3 years ago

@certik Got it. Thank you for clarifying the situation for me, I wasn't aware.

esterjo commented 3 years ago

Apologies if this isn't the right place, but as a broad comment, if I could make one wish to the ISO gods it would be to make it easier for a user to build libraries. This would make it easier for the language to grow with the community

certik commented 3 years ago

@esterjo, you can propose that at https://github.com/j3-fortran/fortran_proposals, just open a new issue and try to describe the features you have in mind in more detail that would make it easier to build libraries. We can discuss it there.

ivan-pi commented 3 years ago

Comment from Youtube user ES on https://youtube.com/watch?v=HI-Yhn7Q8Ko:

Here's how I might choose to implement split in C++ (but you should be able to do it similarly in Fortran (which I'm still learning)): The function or subroutine would have a string input, a set of delimiters input, and it would return/modify a "split_string" object whose structure would look like:

members: the string itself, an array of index locations of the delimiters in the string

member functions: A "get_nth_tocken" function to return the nth token of the string. Or even overload the operator[].

This way I keep allocations small and to a minimum, and scan through the string only once to construct the index array. In my opinion, the best solution would have to involve a derived type because of the ragged array nature of the collection of tokens. Otherwise, the only solution that makes sense to me in the discussion is the split_first_last and the "find" version.

Hi @esterjo, I think your idea with a "split string" type is good. First we would need to agree upon some encapsulated string type (see #69). There are several disjoint community efforts in this direction already. With the new deferred-length character strings, it is quite simple to use something like:

type :: string
  character(len=:), allocatable :: s
end type

A generic interface can be used to overload the split function to return an array of type(string), allocatable :: string_tokens(:) which are of the correct length each. Internally, this function could call split_first_last or split_pos to minimize internal memory shuffling. For Fortran users too lazy to bother with derived types, it will not clash with the version that returns characters with trailing whitespace.

With the great strides fpm has been making, it is already becoming easier to use other community packages. In fact you can already test the version @milancurcic prepared at: https://github.com/milancurcic/fortran202x_split

esterjo commented 3 years ago

How about this:

There should be a string_array type. This type:

Would hold all the "string" elements of the array by storing them side by side in one contiguous character string, called "data".
It would also store an index array (possibly empty) whose nth element marks the end of the nth string in the "data".
A string type would be an instance of this where this index array is empty
Making use of this index array would allow for a function to return the nth string in the "data"

Inheriting form this string_array would be a split_string type:

It would be no different, but it's index array would simply be the locations of delimiters in the contiguous character string
By making use of this index array this child-type would have a function to return the nth token (perhaps overloading the one used by the parent type), which would just be the nth string in the "data" without the first character
Perhaps it holds an array of delimiters

Some things I like about this kind of implementation:

it reduces memory fragmentation because you do not allocate multiple chunks of heap for each string in the array, while also avoiding padding short strings like a simple character array of fixed size elements.
splitting a string type does not reallocate it's character vector, but simply modifies the index array
extra capacity can be added to the end of the "data" array to allow for fast addition of small string_arrays to the end.
if the index array has 2 columns, then the string elements can be accessed as if they are laid out in a matrix

fortran-lang / stdlib

Include a `split` function (202X feature) #241