fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.05k stars 164 forks source link

String handling routines #69

Open ivan-pi opened 4 years ago

ivan-pi commented 4 years ago

Let's start a discussion on routines for string handling and manipulation. The thread over at j3-fortran already collected some ideas:

The discussion also mentioned the proposed iso_varying_string module, which was supposed to include some string routines. I found three distinct implementations of this module:

I also found the following Fortran libraries targeting string handling:

It is likely that several of the tools in the list of popular Fortran projects also contain some tools for working with strings. Given the numerous implementations it seems like this is one of the things where the absence of the standard "... led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes" to borrow the quote of B. Stroustrup in a retrospective of the C++ language.

For comparison here are some links to descriptions of string handling functions in other programming languages:

Obviously, for now we should not aim to cover the full set of features available in other languages. Since the scope is quite big, it might be useful to break this issue into smaller issues for distinct operations (numeric converions, comparisons, finding the occurence of string in a larger string, joining and splitting, regular expressions).

My suggestion would be to start with some of the easy functions like capitalize, count, endswith, startswith, upper, lower, and the conversion routines from numeric types to strings and vice-versa.

jacobwilliams commented 4 years ago

This is a great summary! I feel like most of what we need has already been done in these projects (and others), so mainly we need to just gather it all together. Some important things to decide:

everythingfunctional commented 4 years ago

Should we base it on the ISO_VARYING_STRING module? If so, the class is VARYING_STRING and the procedures are functional (lower(s)).

Should we utilize intrinsic function names? like real(string, [kind, [status]]) and have it stop if the conversion fails and no status variable is provided?

milancurcic commented 4 years ago

functional-fortran implements several functions on strings:

Further, these functions (and their corresponding operators) are compatible with character strings: complement, empty, head, init, intersection, insert, last, reverse, set, sort, split, tail, and union.

(Caution: split in functional-fortran is not quite what's been discussed at j3-fortran repo. It merely splits the string in two and returns the first or second part)

certik commented 4 years ago

Thanks for this initiative and listing the current landscape. I think we definitely want stdlib to have good string support.

(For conversion from real/integer numbers to strings, I implemented a function str to be used like this: https://github.com/certik/fortran-utils/blob/b43bd24cd421509a5bc6d3b9c3eeae8ce856ed88/tests/strings/test_str.f90, implemented here and here, so one can do things like "Number i = " // str(i) // ".".)

@ivan-pi do you want to go ahead and create a table of the basic subroutines and let's brainstorm how they should be named, to be consistent with other languages and/or the above various string implementations if possible. And also if they should be functions or subroutines and what arguments to accept.

certik commented 4 years ago

@jacobwilliams is right about raising the question how to represent the string. We should start with that.

I would recommend (as usual) to have a lowest level API that operates on the standard Fortran (allocatable where appropriate) character. Then, have a higher level API that operates on a string type, and simply calls the lower level API. Regarding a name, see #26, it seems most people agree that the convention to name derived type is to append _t, so it would be string_t.

That way people can use these low level API routines right away. For example in my codes I do not need to modify any data structures and can start using it. The higher level string_t API can then be used by codes that choose to refactor them, or in new codes. If the syntax is not as nice, some people might opt for the lower level API anyway.

everythingfunctional commented 4 years ago

I would vote that the low level API be based on functions, pure and elemental where possible and appropriate. I would stick with the Fortran convention of optional status parameters where there is the possibility of things going wrong, and if one is not provided and something goes wrong it crashes. I have tended to use that convention in any routines that go from a string to some intrinsic like:

if (present(status)) then
    read(string, *) result
else
    read(string, *, iostat=status) result
end if

Honestly, I thought the ISO_VARYING_STRING standard did a great job of covering all of the intrinsic functions available for character(len=*) variables, and extending IO to work with that type (put, get, put_line). Aside from the strange interface for split I think it's a great starting point.

ivan-pi commented 4 years ago

My plan was to go through the libraries above and create a table of the most commonly available routines in the next days.

I agree we should consider both low-level routines which work directly on strings of type character(len=*) and a high-level string_t type.

The book Fortran Tools for VAX/VMS and MS-DOS by Jones & Crabtree contains a description of a Fortran string-handling library. Interestingly, they decided to use null-terminated strings like in C, meaning they needed to build a separate set of functions from the intrinsic ones (concatenation operator // and length function). They later used these tools to develop a compiler for a subset of the Fortran language itself! Their conclusion about strings was:

Fortran is often maligned for its lack of facilities for character-oriented processing. ... The apparent deficiency of Fortran for string manipulation is primarily because of the methods traditionally used rather than because of a shortcoming of the language itself. The main shortcoming of Fortran for string handling is the lack of a standard library of routines for often-needed functions. As Fortran programmers we are faced with a choice: we either invest the up-front effort required to create our own standard library or we live with the continuing effort of hacking together a solution each time we are presented with similar problems.

certik commented 4 years ago

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

ivan-pi commented 4 years ago

Where is the latest ISO_VARYING_STRING implementation? Most links are dead by now. The only version I was able to find so far is this one: http://fortrangis.sourceforge.net/doc/iso__varying__string_8F90_source.html.

I have linked three distinct implementations in the top post. The links from the gfortran compiler pages are dead as well as the link in Modern Fortran Explained by MCR.

Edit: An informal description of the iso_varying_string module for Varying Length Character Strings in Fortran can be found at: http://numat.net/fortran/is1539-2-99.html

certik commented 4 years ago

@ivan-pi thanks. I like your plan. It looks like the iso_varying_string is in the "high-level" API category, as it operates on a VARYING_STRING derived type. Our low-level API would be similar, but operating directly on character(len=*).

jacobwilliams commented 4 years ago

Building the low-level API on character(len=*) variables will be problematic for some operations, since they can't be resized. The high-level API will need to call routines that operate on character(len=:),allocatable variables. So you may end up with two slightly different routines in some cases. So are there really three APIs?

That seems complicated to me... but it would cover all the bases...

milancurcic commented 4 years ago

I think there are two possible APIs here: intrinsic and derived-type one.

For the intrinsic API, character(len=*) works well for input strings. If the function will return a string of known size, you return a charecter(len=something) string. If unknown, you return an allocated character(len=:), allocatable string. User doesn't need to know which one it is.

I also see the intrinsic one as the starting point. Higher-level (derived type) implementation is likely to use the intrinsic API internally.

everythingfunctional commented 4 years ago

My understanding, and somebody correct me if I don't have this quite right, is that the ISO_VARYING_STRING standard was created before character(len=:), allocatable (around 2001 I think?), but then when character(len=:), allocatable was added to the standard, it was supposed to function like variable length strings, and so the former was mostly abandoned. However, I have found most compilers to be buggy with their implementation. Memory leaks when used as the return from a function, failure to properly reallocate on assignment, false-positive warnings about accessing uninitialized memory, etc.

If allocatable character actually worked we wouldn't a new derived type for strings. You would just use the intrinsic type and move on. But I think as written in the standard, it probably will never truly work properly in all cases (especially as in read statements, since other allocatable arrays don't and aren't supposed to).

If there is a new type for strings, I don't think a lower level library or API should be exposed, and it should probably not be based on allocatable characters.

certik commented 4 years ago

As I mentioned above, you can use this trick to return character(len=N) strings from functions. The downside is that the string operation gets executed twice --- once to compute the length in the pure procedure, and second time to actually return it. So we probably don't want to do it that way. What I was thinking is to do what @milancurcic suggested: use character(len=*) as well as character(len=N) where we can, and use character(len=:),allocatable to avoid doing the operation twice as I described above. And that's the low level API. Below I provide two examples: https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570399250 and https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570403926, to show one one would decide whether to expose character(len=*) or character(len=:), allocatable.

As @everythingfunctional mentioned, for example GFortran used to have huge problems with allocatable strings and leaked memory. The latest version has improved a lot. Given that this is standard Fortran, and stdlib is a standard library, I think it is ok if we depend on the standard, and if there are compiler bugs, we'll try to workaround them and ensure they are reported. Regarding read statements, see #14 that would handle that. I think we should at least try to create a consistent low level API, not give up without even trying. If it truly cannot be done, only then we'll have to do what you propose, and only expose the string_t type and report the bugs to compilers (and keep the list somewhere) and propose improvements to the language itself, so that it can be done in the future.

everythingfunctional commented 4 years ago

I thought you could only use intrinsic procedures in variable declaration statements. Learned something new. That's a neat trick, but like you said, not particularly efficient.

certik commented 4 years ago

Let's discuss a simple example: upcase.

character(*)

Here is an implementation:

function upcase(s) result(t)
! Returns string 's' in uppercase  
character(*), intent(in) :: s
character(len(s)) :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

When the user wants to use it, he could do this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
allocate(character(len(s)) :: a)
a = upcase(s)
print *, a

which prints:

 Some string
 SOME STRING

The main disadvantage of this approach is that the user needs to know the size ahead of time. In this case he knows --- it's the same size as the original string. Although modern gfortran has reallocatable LHS turned on, so then just this works:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

So I think that would work for upcase.

character(:), allocatable

Here is the implementation using character(:), allocatable

function upcase(s) result(t)
! Returns string 's' in uppercase
character(*), intent(in) :: s
character(:), allocatable :: t
integer :: i, diff
t = s; diff = ichar('A')-ichar('a')
do i = 1, len(t)
    if (ichar(t(i:i)) >= ichar('a') .and. ichar(t(i:i)) <= ichar('z')) then
        ! if lowercase, make uppercase
        t(i:i) = char(ichar(t(i:i)) + diff)
    end if
end do
end function

It's still used like this:

character(*), parameter :: s = "Some string"
character(:), allocatable :: a
print *, s
a = upcase(s)
print *, a

But since this as an extra allocation inside upcase, I would think that in this case, the character(*) version is better.

certik commented 4 years ago

Now let's discuss integer to string conversion, the two implementations:

character(*)

pure integer function str_int_len(i) result(sz)
! Returns the length of the string representation of 'i'
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: s
! If 's' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(s, '(i0)') i
sz = len_trim(s)
end function

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
character(len=str_int_len(i)) :: s
write(s, '(i0)') i
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

character(:), allocatable

pure function str_int(i) result(s)
! Converts integer "i" to string
integer, intent(in) :: i
integer, parameter :: MAX_STR = 100
character(MAX_STR) :: tmp
character(:), allocatable :: s
! If 'tmp' is too short (MAX_STR too small), Fortran will abort with:
! "Fortran runtime error: End of record"
write(tmp, '(i0)') i
s = trim(tmp)
end function

And usage:

character(:), allocatable :: a
a = str_int(12345)
print *, a, len(a)

which prints:

 12345           5

Discussion

Unlike in the upcase (see previous comment), here the character(*) version is converting twice, so it is inefficient. The character(:), allocatable version just converts once, and so that would be the preferable API.

(Note: if we implement our own integer to string conversion algorithm, then we avoid the ugly MAX_STR thing and the need to call trim. The above implementation was reused from my codes, where I just use the Fortran intrinsic conversion as part of write so that I save code.)

certik commented 4 years ago

Here is my proposal for the low level API:

  1. Use character(*) where possible and efficient (see the previous two comments for examples how to decide)
  2. Use character(:), allocatable otherwise
  3. For compilers that cannot compile the code (or leak memory): create a workaround subroutine with a different (less nice or less efficient) API and use that instead for those compilers only, and report the compiler bug and reference it in the code. As a community we have contacts to compiler vendors, and we can communicate this and help get this fixed. The long term goal would be to eventually have no workarounds in 3.

Unfortunately some compilers might leak memory or segfault when such strings are used in derived types. Ultimately, long term, the compilers must be fixed. That's why I think the above proposal is a good one for the long term. In the short term, if we want to provide strings to users that actually work in all today's compilers, it might be that the only way is to create a string_t type not based on allocatable strings, in which case one could still use the low level API with the workarounds 1., 2., and 3., but make a copy of the result into the derived type string_t that is internally represented differently, so that today's compilers do not leak memory. That would be less efficient than providing a separate string implementation, but it's only a short term issue anyway, until compilers catch up. (Alternatively we can have an efficient duplicate implementation based on the internal string_t representation directly if we want better performance until compilers catch up.)

ivan-pi commented 4 years ago

Specifically for the case of integer to string conversion, you could also dynamically allocate a buffer for each integer kind and then trim the result into an allocatable character string:

    function integer_to_string2(i) result(res)
      character(len=:),allocatable :: res
      integer, intent(in) :: i
      character(len=range(i)+2) :: tmp
      write(tmp,'(i0)') i
      res = trim(tmp)
    end function

If we want to avoid internal I/O this function becomes something like

    function integer_to_string1(ival) result(str)
        integer, intent(in) :: ival
        character(len=:), allocatable :: str
        integer, parameter :: ibuffer_len = range(ival)+2
        character(len=ibuffer_len) :: buffer
        integer :: i, sign, n

        if (ival == 0) then
            str = '0'
            return
        end if

        sign = 1
        if (ival < 0) sign = -1

        n = abs(ival)
        buffer = ""

        i = ibuffer_len
        do while (n > 0)
            buffer(i:i) = char(mod(n,10) + ichar('0'))
            n = n/10
            i = i - 1
        end do
        if (sign == -1) then
            buffer(i:i) = '-'
            i = i - 1
        end if

        str = buffer(i+1:ibuffer_len)
    end function

For processing floating point values the functions are much more difficult to develop compared to those using internal read and write statements.

ivan-pi commented 4 years ago

I did some keyword searchs in the list of popular Fortran projects. It seems that most projects use their own set of character conversion and string handling routines for stuff like reading input values from files, parsing command line options, defining settings, etc..

Here are the results of my search of some of the top projects:

Project # of "string" # of "character" # of Fortran files
ElmerFEM 248 1319 2076
WRF 306 966 1668
fds 16 28 41
quantum-Espresso 66 472 1516
fluidity 38 279 747
json-fortran 26 47 49
fortranlib 11 18 38
Nek5000 54 204 336
cp2k 439 1043 1132
nastran-95 85 551 1838
specfem3d 186 404 765
nwchem 323 2768 17214
gtk-fortran 59 77 92
cfl3d 14 216 397
shtools 2 20 113
arpack-ng 1 259 332

The second and third column measure the number of Fortran files that contain the keywords string or character, respectively. This includes both command statements and comments so it may be a bit misleading.

In one of the codebases I even found this comment:

    ! String parsing in Fortran
    ! is such a pain
    ! it's unreal
ivan-pi commented 4 years ago

Casing

The purpose of these functions is to return of copy of a character string ( either character(len=*) or a derived string type) with the case converted . The common variants are uppercase, lowercase, and titlecase.

The libraries cited in the first post contain the following function prototypes:

! functional
function str_upper(str)
function str_lower(str)
function str_swapcase(str)
pure function ucase(input)
pure function lcase(input)
function str_lowercase(str)
function str_uppercase(str)
subroutine str_convert_to_lowercase(str)
subroutine str_convert_to_uppercase(str)
pure elemental function lowercase_string(str)
function uppercase(str)
function lowercase(str)

! object-oriented
procedure, pass(self) :: camelcase
procedure, pass(self) :: capitalize
procedure, pass(self) :: lower
procedure, pass(self) :: snakecase
procedure, pass(self) :: startcase
procedure, pass(self) :: upper
function vstring_tolower(this[,first,last])
function vstring_toupper(this[,first,last])
function vstring_totitle(this[,first,last])

Some versions will return a new string, while some work in place. In at least one of the functions, it did not convert the case of characters enclosed between quotation marks.

These are the similar functions available in other programming languages:

My top three name picks are:

  1. uppercase/lowercase/titlecase
  2. to_upper/to_lower/to_title
  3. upper/lower/capitalize

Edit: for consistency with the character conversions functions to_lower/to_upper in the module stdlib_experimental_ascii it is maybe better to go for option 2.

milancurcic commented 4 years ago

I'd like to add to the list of facilities here the overloaded operator * between integers and strings, so that you can do, like in Python:

print *, 3 * 'hello' ! prints 'hellohellohello'
print *, 'world' * 2 ! prints 'worldworld' 

It's easy to make and use. The only downside I can think of is a somewhat weird API when importing it:

use stdlib_experimental_strings, only: operator(*)
ivan-pi commented 4 years ago

Yes, I have seen this kind of usage in one of the above mentioned libraries. I am not sure whether it is not perhaps better to promote the usage of the intrinsic repeat function. As the Zen of Python states: There should be one-- and preferably only one --obvious way to do it.

A benefit of repeat is precisely that you avoid the import statement.

milancurcic commented 4 years ago

Oops, I didn't know about repeat. Indeed it's the way to go so I withdraw my proposal. I need to brush up on my canonical Fortran. :)

zbeekman commented 4 years ago

It is really hard to have a day job and keep up with all these threads, so my apologies if I've missed something because I'm just skimming here. A few opinionated notes:

I need to look at the varying string and character array proposals in more detail.

FWIW, I personally prefer the Ruby Python OO approaches with methods because it will make import statements much simpler: Pull in the string class and you get all the methods along with the type/class declaration. Now some operators may need to be pulled in as well if you want to be able to concatenate a real (lhs) with a string (rhs, can't have a TBP operator to the left of the object IIRC).

I was thinking of starting a PR marrying my work on ZstdFortranLib with a UDT/Class approach rather than operating on raw character scalars and arrays which is awkward for things like split(). But now I need to catch up on the myriad of proposals and prior art, so don't hold your breath.

zbeekman commented 4 years ago

While there is an intrinsic implementation, repeat(), I still like the more concise syntax which has pretty clear meaning for anyone who has ever worked with languages like Python and Ruby. It would be nice if some of these syntactic sugar items were added to the standard rather than a standard library. But until then I would be happy with * overloaded for characters.

certik commented 4 years ago

@zbeekman I am struggling with all the threads also, but that is good news. It means there is lots of momentum. If you can help us design a good low and high level API for strings (https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570409245), that would be great.

zbeekman commented 4 years ago

@certik k

Now let's discuss integer to string conversion, the two implementations:

I like your first one the most. With integers you can use some math to count up how many digits there are, and if you need a sign on the front, which completely removes the need to declare the max string length AND to do the IO twice. Instead you use integer and floating point math which (hopefully) will be reasonably quick. IIRC, I implemented something to do this in JSON-Fortran but I'll have to look for it.

Also, I don't mean to whine about not being able to keep up, and I agree that it's good, but it's hard to keep track of all the balls in the air.

zbeekman commented 4 years ago

After a brief search there are at least 3 ways to do this without performing the conversion to a string then counting digits:

  1. Iteration:
    len = 1
    if ( n < 0 ) len = 2
    do
     n = n / 10
     if (n == 0) exit
     len = len + 1
    end do
  2. Tail recursion (same algorithm as above: It will be optimized to code above by compiler or it will be slower if the compiler uses recursion)
  3. "One shot" method using log10
    len = floor( log10( real( abs( n ) ) ) + 1 )
    if ( n < 0 ) len = len + 1

I would guess that 1 is the fastest way to do this, but it may depend on the compiler and hardware. 3 has conversion to a real, then log10 is probably computed iteratively, and it is converted back to an int, so 1. may be faster despite the loop.

certik commented 4 years ago

@zbeekman great idea. I suspect even faster would be 1. together with putting the digits into the string right away. And thus using the "allocatable" approach. In your approach, you compute the length of the string in 1., but then you have to do a similar loop again when you do the actual string conversion. But your approach is definitely a huge improvement, so we could us my first approach above also with this.

zbeekman commented 4 years ago

Why not keep going a little here:

elemental function int_str_len(n) result(res)
    integer, value :: n
    integer :: res

    res = merge(1, 2, i >= 0)
    do
        n = n / 10
        if ( n == 0 ) return
        res = res + 1
    end do
end function int_str_len

I suspect even faster would be 1. together with putting the digits into the string right away. And thus using the "allocatable" approach.

I prefer not working with allocatable character function results due to compiler bugs, but whether or not the compiler is implicitly generating a loop for write(s, '(i0)') i may be interesting from a performance perspective...

Doh! Ah, the dangers of drinking from the fire-hose of fortran-lang/stdlib: It appears @ivan-pi has essentially already figured all this out and implemented a similar version above.

certik commented 4 years ago

See also https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570409245 how I propose to handle the compiler bugs.

ivan-pi commented 4 years ago

Ruby is my favorite language for string processing, and IMO is the best at it. If no one objects (especially @ivan-pi) I'll put links in the first post on this issue

Feel free to edit the first post. I only listed the languages that came off the top of my head.

Doh! Ah, the dangers of drinking from the fire-hose of fortran-lang/stdlib: It appears @ivan-pi has essentially already figured all this out and implemented a similar version above.

I adapted that version from a 1988 book on Fortran tools but took the "one-shot" approach to declare a sufficiently sized buffer. The range intrinsic essentially returns the value of floor(log10(huge(x)) converting the integer to a real first.

zbeekman commented 4 years ago

The range intrinsic essentially returns the value of floor(log10(huge(x)) converting the integer to a real first.

Yes I saw that. And then you cleverly work backwards from the temporary array local variable that holds the largest possible integer you could create. Pretty neat!

I think at the end of the day the allocating the right length string issue is where we'll want some benchmarking with different popular compilers. I'm not convinced that internal IO would be slower than this, but I haven't bench marked it and would happily be wrong!

ivan-pi commented 4 years ago

In the book where I found this approach they say:

The reason for not using internal files is mainly to avoid Fortran I/O if possible on microcomputers with limited memory, with the hope that the host compiler won't include Fortran I/O in the executable image; it is also kind of nice to be able to do it on your own. In our tests of the IBM-PC progams that do not use Fortran I/O in the primitives, replacing versions of itoa and atoi using internal files with the ones shown here reduced the size of executable program by 21k bytes.

I cannot say whether this is true also for modern Fortran compilers or, if it even matters given the large amount of memory available to us today.

zbeekman commented 4 years ago

To follow @gronki's original format of:

  • name of the utility
  • short description
  • does it exist in other languages
  • proposed example of usage

Strip

Chomp

Split

Join

Gsub

Sub

Center

Conversion from character to real, integer, complex, logical, etc.

Integer, real, logical, complex conversion to character strings (ideally for all available kinds)

Overloaded concatenation


Most Python and Ruby string methods with idiomatic, and straightforward-ish implementations in Fortran would be great to have. Cases where intrinsics are already present need not be re-written unless we create a string class vs a functional style. Some others have no obvious or straightforward implementation until more infrastructure (like regex handling) is in place. Regex handlers will almost certainly need to use an external library, unless someone knows how to write parsers and lexers using tools that emit Fortran.

zbeekman commented 4 years ago

I think and important issue for us is to decide:

  1. Functional only approach
  2. OO only approach
  3. Both

I don't think it would be that difficult to provide both a functional interface and an OO interface, and the OO interface could leverage the functional implementation in most places.

zbeekman commented 4 years ago

Also worth considering: Do we want basename, dirname, name_we, extname type of functions to be part of a string class or a file/os class/module?

certik commented 4 years ago

Do we want basename, dirname, name_we, extname type of functions to be part of a string class or a file/os class/module?

That should be the OS module I would think.

I think and important issue for us is to decide:

1. Functional only approach

2. OO only approach

3. Both

It looks like 3. it should be, as we can all agree on that one and move on to actually implement this (as opposed to keep discussing whether to do 1. or 2.).

everythingfunctional commented 4 years ago

Responding to some of @zbeekman proposals

Strip

Doesn't trim already do exactly this? Why have two functions to do exactly the same thing?

Chomp

I think you need more examples about why this is needed. Is the idea not to remove trailing whitespace if it's not a record separator? I.e. chomp("hello \n") => "hello "

Split

There is some subtleties about how split works that need to be pointed out. In Python (and other languages?, I'd have to double check) the separator argument is taken as a list of possible separators, not as a pattern that must be found. (i.e. split("Hello ,World", ", ") => ["Hello", "World"]).

Also, should empty strings be included in the resulting array? I think probably not, or you'd get something you probably didn't intend from a standard use case like split("A, list, of, words", ", ") => ["A", "", "list", "", "of", "", "words"] due to the multiple separators next to each other.

I've got an implementation of split (splitAt) here that behaves like Python's.

Gsub and sub

The replace function from the ISO_VARYING_STRING module combines these two uses into one interface (replace(string, target, substring, every, back)). I don't know if that's better or worse, just that there is already a precedent for it.

Conversion from character to real, integer, complex, logical, etc.

What should happen if the string can't be converted? Most other languages throw exceptions that can be caught, but Fortran can't do that. Providing an optional iostat argument would seem to fit in with the Fortran "style".

Integer, real, logical, complex conversion to character strings (ideally for all available kinds)

The question really is what format should real and complex numbers take. Use the f format specifier or g, or something else? To what precision? What about trailing zeros? or rounding for stuff like 1.00000000001? I've got an implementation here that picks the shorter string from regular vs scientific notation, removes trailing zeros, and give the full precision available for that kind if a number of significant digits isn't specified, but I'm not certain that's what everyone would prefer.

arjenmarkus commented 4 years ago

Like others, I have only skimmed the postings, so I may have missed it, but has the issue of meaningful trailing blanks been discussed? Normally trailing blanks are just an inconvenience and I guess with the allocatable-length strings we have now, the issue is less pressing, but some consideration does seem useful.

arjenmarkus commented 4 years ago

@zbeekman mentioned in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570725303 parsers and lexers implemented in Fortran. Some years ago I adjusted the SQLite "lemon" parser generator so that it will emit Fortran code - https://sourceforge.net/p/flibs/svncode/HEAD/tree/trunk/src/lemon/. I merely adjusted the code generation parts of the original code and I admit not having used it much, but it could be a starting point. In the same Flibs project: Paul Fossati created an interface for the PCRE library for regexps.

certik commented 4 years ago

has the issue of meaningful trailing blanks been discussed?

Great question, I don't think we explicitly discussed this yet. I was hoping that we will assume and require that there are no trailing blanks (both in the low level as well as the OO API), unless the user wants them there (but in that case "x" and "x " will be treated as different strings). In particular, no need to call trim all over.

arjenmarkus commented 4 years ago

Quite possibly the absence of any magic character - \0 for C-like languages and trailing blanks for Fortran - may automatically resolve the issue, but we will have to take care that such features/quirks do not sneak in ;).

zbeekman commented 4 years ago

@everythingfunctional: I'm not in favor of unnecessary duplication either, and this is just a personal wishlist to shape discussion

Doesn't trim already do exactly this? Why have two functions to do exactly the same thing?

No, trim only handles trailing blanks which are spaces unless I'm misunderstanding something. Strip handles leading and trailing whitespace of all forms.

Is the idea not to remove trailing whitespace if it's not a record separator?

Yes and to provide flexibility on what that separator might be, e.g., commas after all elements except the last one in a list of items.

There is some subtleties about how split works that need to be pointed out.

Agreed. I was quoting the way Ruby does it straight from Ruby's API docs, and I think the examples I posted spell out the semantics fairly well.

In my very opinionated opinion, Ruby is the best language for string handling, so I like APIs that match, where it makes sense, Rubys. But Python has good string handling too, certainly better than Fortran. Just proposing a way to do it, not saying we have to do it that way.

The replace function from the ISO_VARYING_STRING module combines these two uses into one interface (replace(string, target, substring, every, back)). I don't know if that's better or worse, just that there is already a precedent for it.

I haven't had a chance to look at ISO_VARYING_STRING yet. If people like it and there's broad support (and, even better an implementation) then I would happily defer to replace and any other useful functions/utilities. But since we're making something new, I wanted to share what I like from Ruby.

What should happen if the string can't be converted? Most other languages throw exceptions that can be caught, but Fortran can't do that. Providing an optional iostat argument would seem to fit in with the Fortran "style".

Very good question. Where a function is doing the converting it should have an iostat argument. Typically you can convert from one character set into utf8/unicode, going in the other direction requires a check that the characters being converted from unicode exist in the other character set. For concatenation, you can convert asymmetric concatenation to ISO_10646 if it's present.

The question really is what format should real and complex numbers take. Use the f format specifier or g, or something else? To what precision? What about trailing zeros? or rounding for stuff like 1.00000000001?

Yes, I implemented similar logic in JSON-Fortran (with input and help from @jacobwilliams) I think it makes sense to provide some sane defaults with a function to let the user specify their own default output format for reals, complex, integers, etc.

zbeekman commented 4 years ago

Paul Fossati created an interface for the PCRE library for regexps.

I was thinking that using this might be wise. But it introduces a (perhaps critical) external dependency, that may be an unpopular opinion.

zbeekman commented 4 years ago

has the issue of meaningful trailing blanks been discussed?

One of the nice things about most of the ruby APIs is that they give you some flexibility to change semantics for, e.g., repeated blanks, or other patterns. But a lot of this extended capability relies on a regex syntax, which, right now, is probably a non-starter.

certik commented 4 years ago

In order to move this discussion forward --- it seems the implementations are (relatively) straightforward, the hard part is to agree on an API: a set of functions, their name, arguments for the low level API, and similarly for the high level API. If that's the case, @zbeekman, @ivan-pi do you want to start a document where we'll start discussing the function names and their arguments and functionality?

I don't know if the best way is to open a few draft PRs with the document (I assume there will be a few competing versions at first), or if there is another way to discuss a full document.

zbeekman commented 4 years ago

Sure. I need to make myself familiar with N1375 and N1379 And see how it sits with what I had anticipated doing. There's no point reinventing the wheel, so if ISO_VARYING_STRING appeals to me and seems congruent with the opinions I've heard in this thread so far it might be a good starting point. @everythingfunctional seems to already have an MIT licensed implementation, so that could even be a good starting point.

certik commented 4 years ago

ISO_VARYING_STRING operates on a derived type, so that is the high level API. The low level API would operate on character string directly.

I was hoping to also have some document that compares different languages, like @ivan-pi did above, so that we can stay compatible with the API, whenever it makes sense.