fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.07k stars 164 forks source link

Replace substring in character/string_type #366

Open ivan-pi opened 3 years ago

ivan-pi commented 3 years ago

I also propose to add a few other functions that can be commonly used.

The community can have a discussion over the need for such functions and if needed also suggest some other implementation of Replace function.

Originally posted by @ChetanKarwa in https://github.com/fortran-lang/stdlib/issues/364#issuecomment-808800508

ivan-pi commented 3 years ago

This has also been requested a few times before.

@zbeekman wrote in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570690516:

sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string gsub: Replaces all occurrences with a substring

@jacobwilliams suggested it also in https://github.com/j3-fortran/fortran_proposals/issues/96#issuecomment-557896465:

Back to original topic. Here's one I use all the time:

1. `string_replace`

2. Replace all occurrences of one string with another

3. Yes, Python has this:
>>> 'aaaa'.replace('a','bb')
'bbbbbbbb'
1. Something along these lines:
character(len=:),allocatable :: str
str = 'aaaa'
call string_replace(str,'a','bb')
write(*,*) str ! writes bbbbbbbb
awvwgk commented 3 years ago

There is also a proposal from the string thread:

Gsub

  • Global substitution. Same as @jacobwilliams' string_replace as far as I can tell. Eventually it would be nice to support regular expressions like Ruby, but, for now, we should start smaller and just make it behave like: sed 's/<pattern>/<replacement>/g'
  • Kinda Ruby, Python and sed. Behave like Python without the optional count argument
  • Fortran:
    gsub("hello", "l", "L") ! "heLLo"
    gsub("the quick brown fox jumped over the lazy dog", "the", "a")
     ! "a quick brown fox jumped over a lazy dog"
    !! Or OO approach
    s = "hello"
    s%sub("l","L")              ! "heLLo"

Sub

  • Replace first (or last) instance of one substring with another
  • Again, similarities with Python and Ruby, but not an exact match
  • Fortran:
    sub("hello", "l", "L") ! "heLlo"
    sub("hello", "l", "L", back=.true.) ! "helLo"
    sub("the quick brown fox jumped over the lazy dog", "the", "a")
     ! "a quick brown fox jumped over the lazy dog"
    sub("the quick brown fox jumped over the lazy dog", "the", "a", back=.true.)
     ! "the quick brown fox jumped over a lazy dog"
    !! Or OO approach
    s = "hello"
    s%sub("l","L")              ! "heLlo"
    s%sub("l","L", back=.true.) ! "helLo"

Originally proposed by @zbeekman in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570725303

awvwgk commented 3 years ago

We have to be a bit careful with the string related patches, because there are only a few modules and many independent features. This will likely result in many conflicting patches and will require a lot of rebasing along the way. The module to put those functions in would be stdlib_strings, which is not yet created on the default branch.

aman-godara commented 3 years ago
  • Third implementation: string.replaceall(pos, old_string, new_string)

    • This meaning of variables is similar to the second implementation.
    • The difference is it will find and replace all the occurrences of old_string after pos and not just one.

What would we expect the function to return if the input is "qwqwqw" and we want to replace all "qwq" with let's say "abc"? Option 1: "abcwqw" Option 2: "abcabcw" (here length of the output is greater than the length of the input) [EDIT]: Option 3: abcw

Option 2 is inspired from the fact that the second q is part of two occurrences of qwq. I cannot think of a particular use case where a user might want option 2 to be the output but it occurred to me so I thought it would be better to put it on the table.

urbanjost commented 3 years ago

features I would like would be to

ivan-pi commented 3 years ago

I also support the options suggested by @urbanjost:

but I would prefer to split them into two functions.

ChetanKarwa commented 3 years ago

Function introduced by @urbanjost can work as an all in one pack for any kind of "replace" operation that a user wants to execute. I support such an all in one function.

But I have a doubt, in deciding the Nth occurrence. For eg: if the parent string is "abababababa" and the old string is "aba"

ivan-pi commented 3 years ago

What will be the 3rd occurrence of old in this parent

  • abab aba baba
  • abababab aba

Python uses a non-overlapping count:

>>> "abababababa".replace("aba","zzz")
'zzzbzzzbzzz'
>>> "abababababa".replace("aba","zzz",1)
'zzzbabababa'
>>> "abababababa".replace("aba","zzz",2)
'zzzbzzzbaba'

which corresponds to your second bullet point.

Edit: The StringiFor library also provides a replace method matching the Python one.

ChetanKarwa commented 3 years ago

If everyone is fine with the idea. I wish to start working on this issue. I can start off by implementing the function proposed by @urbanjost.

awvwgk commented 3 years ago

Feel free to start working on a patch, @ChetanKarwa . You can either start from the default branch and we will figure out rebasing the patch later, or you could base your feature branch on the patch in #343, which already creates the stdlib_strings module.

Let me know if you need any help with this.

ChetanKarwa commented 3 years ago

I wanted to ask these questions:

ivan-pi commented 3 years ago
  • Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.

I suggest you omit cmd for now.

How should the function be accessed.

  • replace(parent_str, old_str, new_str)
  • parent_str.replace(old_str,new_str)

The first one. (For the second version you would need first a string class (#333) which is not available yet. Moreover, the second version would likely just call the first version.) More importantly I think replace will be a function and not a subroutine.

If KMP is not too challenging I think that would be a great start. I guess in practice there will be some trade-offs (e.g. depending on the string sizes, cache size, etc.).

urbanjost commented 3 years ago

I meant to mention not to consider the "cmd" option. I consider the other functionality desirable (ignore case of the old string, replace right to left as well as left to right, some type of selection of which occurrence(s) to replace).

The replace() function mentioned has a lot of history, and was originally part of a line editor so "cmd" preceded having an old and new string.Although still used in my case I would not recommend it in a new implementation. I used it as an example of time-tested desirable features, not as an implementation model. For the cases it was intended for the volume of data being changed did not require much efficiency, but I can see use cases where efficiency would be desirable,

A KMP implementation would potentially be reusable for other functions ( I would be curious to time it against INDEX(3f) in different compilers too) as well, and seems worth implementing. I would be curious if anyone has a known use case for processing large amounts of data with replace(3f) or for searching for words efficiently in general in other proposed stdlib procedures.

I would implement a request to search from right to left with a logical called BACK to be consistent with intrinsics such as INDEX(3f), VERIFY(3f), and SCAN(3f) for consistency instead of using the sign of the occurrence.

arjenmarkus commented 3 years ago

An implementation of "replace all" should be careful with overlapping strings. Consider the following example with @Aman-Godara 's option 2: The string qwqwq and replacing qwq by abc to give abcabcw.

If the replacing string is qwqwq you would get an infinitely long string.

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I find it very convenient in my Tcl programs and I could probably use it in Fortran as well.

ChetanKarwa commented 3 years ago

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

arjenmarkus commented 3 years ago

I did not mean to suggest that the current round should include a Fortran version of this functionality ;). Just wanted to add it to the wishlist.

Op wo 31 mrt. 2021 om 14:12 schreef ChetanKarwa @.***>:

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811020539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRYVLQN5DJBRTB2WFLDTGMGSNANCNFSM4Z5JHY4A .

ivan-pi commented 3 years ago

In addition to the above examples from other languages, I would like to point out the command string map from Tcl: set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

You can find a complete explanation in the Tcler wiki: https://wiki.tcl-lang.org/page/string+map

Translated to Fortran it might look like

a = "A B C A"
b = replace(a,[map("A","AA"),map("B","BB")])

It is a more powerful version of replace which allows to map multiple substrings in one pass. The rule to prevent overlap issues is

Replacement is done in an ordered manner, so the key appearing first in the list will be checked first, and so on. string is only iterated over once, so earlier key replacements will have no effect for later key matches.

But this should go to a future PR.

urbanjost commented 3 years ago

although TCL calls them keys and values think of the map list as just a set of old1,new1 old2,new2, old3,new3 so that would be equivalent to multiple calls to REPLACE(3f) assuming REPLACE(3f) is not elemental, except that it is all done in one pass with respect to the original string I think. Actually not sure which way TCL did that, and thought TCL would retain the spaces so it has been too long since I used TCL but either way it raises the question of allowing multiple pairs or old and new values in one call, and if you did would the replacements be cumulative or not? ,, So if you started with the string "AB" and said you wanted to replace "A" with "B" and "B" with "XXX" if you did it one pair at a time and used the output of previous changes as input to new changes (like you called REPLACE(3f) twice) you would get AB=>BB by the first replacement, and then BB=>XXXXXX; but if you did all replacements relative to the original input string you would get "BXXX". So if you allow multiple OLD,NEW pairs you have at least those two possible outcomes to consider. Again, I have not used TCL in a long time, but irregardless if you allow multiple pairs you have to decide if the result should be identical to multiple calls to REPLACE(3f) or if all the substitutions would be in reference to the original string, more like variable name substitution in the bash(1) shell, for example.

aman-godara commented 3 years ago

I wish to work on this issue. Please assign this to me.

Also, we have intrinsic index function in fortran and some of the above mentioned features like replace_all and getting the index of the 3rd occurrence of the substring in the input string when implemented efficiently require manipulating the index function internally. So we will have to implement index function from scratch. Also, @awvwgk pointed out that we cannot name this new function as index. I think may be because its features conflicts with the intrinsic index function. Please also note: We currently have index function in our stdlib_string_type module as well.

awvwgk commented 3 years ago

Maybe there is a better name like find_substring or just find instead of index? If the interface turns out similar enough to the intrinsic index function we can think about overloading the index function with this functionality and eventually propose it as addition to the Fortran standard.

aman-godara commented 3 years ago

Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all. I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review). But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.

But tcl's replace_map functionality is still not added in this replace_all. I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"} will it be Option 1). "wpqrsdsw" Option 2). "wpqrsw"?

Taking example from @urbanjost's comment. I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.

arjenmarkus commented 3 years ago

I just tried your example and the answer is option 2.

Op di 15 jun. 2021 om 21:27 schreef Aman Godara @.***>:

Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all. I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review). But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.

But tcl's replace_map functionality is still not added in this replace_all. I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"} will it be Option 1). "wpqrsdsw" Option 2). "wpqrsw"?

Taking example from @urbanjost https://github.com/urbanjost's comment https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811040307 . I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-861772550, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR2KCZH6HYNDOPPTOKTTS6SR5ANCNFSM4Z5JHY4A .

aman-godara commented 3 years ago

I tried an online compiler for Tcl (by tutorials point) to understand more about the behaviour of the function.

Here is the code:

set a "wababw"
set a [string map {"abab" "ppp" "abw" "tt"} $a]
puts $a

output is: wpppw.

It confirmed that overlapping substitutions are avoided.

Also, one thing to note is when I tried

set a "a b c a"
set a [string map {"a" "aa" "b" "bb"} $a]
puts $a

Output retained the spaces: "aa bb c aa" for this online compiler (it says Tcl v8.6.6) and tried few other similar operations to conclude that the length of the output may not be equal to that of input.

Conclusion: The "hacky way" discussed in here will work everytime to deliver the behaviour of Tcl's replace_map functionality, only hurdle is to very carefully come up with dummy character(s).

But IMO there can be other ways to deliver replace_map functionality. Inspiration Aho_Corasick_Algorithm but not exactly this.