Replace substring in character/string_type

ivan-pi commented 3 years ago

I also propose to add a few other functions that can be commonly used.

Replace function, this can be implemented in several ways:
- First implementation: string.replace(pos , len, new_string)
  - pos = starting position of the sub-string to be replaced.
  - len = length of the substring
  - new_string = the string that will replace the substring
- Second implementation: string.replace(pos, old_string, new_string)
  - old_string = substring to be replaced.
  - pos = position to start the search of old_string
  - new_string = string that will replace old_substring
  - This function will replace only the first occurrence of the old_string.
- Third implementation: string.replaceall(pos, old_string, new_string)
  - This meaning of variables is similar to the second implementation.
  - The difference is it will find and replace all the occurrences of old_string after pos and not just one.
We already have an operator to Concat a string with another but I feel we should also implement a function for the same.

The community can have a discussion over the need for such functions and if needed also suggest some other implementation of Replace function.

Originally posted by @ChetanKarwa in https://github.com/fortran-lang/stdlib/issues/364#issuecomment-808800508

ivan-pi commented 3 years ago

This has also been requested a few times before.

@zbeekman wrote in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570690516:

sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string gsub: Replaces all occurrences with a substring

@jacobwilliams suggested it also in https://github.com/j3-fortran/fortran_proposals/issues/96#issuecomment-557896465:

Back to original topic. Here's one I use all the time:

1. `string_replace`

2. Replace all occurrences of one string with another

3. Yes, Python has this:

>>> 'aaaa'.replace('a','bb')
'bbbbbbbb'

1. Something along these lines:

character(len=:),allocatable :: str
str = 'aaaa'
call string_replace(str,'a','bb')
write(*,*) str ! writes bbbbbbbb

awvwgk commented 3 years ago

There is also a proposal from the string thread:

Gsub
Global substitution. Same as @jacobwilliams' string_replace as far as I can tell. Eventually it would be nice to support regular expressions like Ruby, but, for now, we should start smaller and just make it behave like: sed 's/<pattern>/<replacement>/g'

Kinda Ruby, Python and sed. Behave like Python without the optional count argument
Fortran:
gsub("hello", "l", "L") ! "heLLo"
gsub("the quick brown fox jumped over the lazy dog", "the", "a")
 ! "a quick brown fox jumped over a lazy dog"
!! Or OO approach
s = "hello"
s%sub("l","L")              ! "heLLo"
Sub
Replace first (or last) instance of one substring with another

Again, similarities with Python and Ruby, but not an exact match
Fortran:
sub("hello", "l", "L") ! "heLlo"
sub("hello", "l", "L", back=.true.) ! "helLo"
sub("the quick brown fox jumped over the lazy dog", "the", "a")
 ! "a quick brown fox jumped over the lazy dog"
sub("the quick brown fox jumped over the lazy dog", "the", "a", back=.true.)
 ! "the quick brown fox jumped over a lazy dog"
!! Or OO approach
s = "hello"
s%sub("l","L")              ! "heLlo"
s%sub("l","L", back=.true.) ! "helLo"

Originally proposed by @zbeekman in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570725303

awvwgk commented 3 years ago

We have to be a bit careful with the string related patches, because there are only a few modules and many independent features. This will likely result in many conflicting patches and will require a lot of rebasing along the way. The module to put those functions in would be stdlib_strings, which is not yet created on the default branch.

aman-godara commented 3 years ago

Third implementation: string.replaceall(pos, old_string, new_string)

This meaning of variables is similar to the second implementation.

The difference is it will find and replace all the occurrences of old_string after pos and not just one.

What would we expect the function to return if the input is "qwqwqw" and we want to replace all "qwq" with let's say "abc"? Option 1: "abcwqw" Option 2: "abcabcw" (here length of the output is greater than the length of the input) [EDIT]: Option 3: abcw

Option 2 is inspired from the fact that the second q is part of two occurrences of qwq. I cannot think of a particular use case where a user might want option 2 to be the output but it occurred to me so I thought it would be better to put it on the table.

urbanjost commented 3 years ago

features I would like would be to

specify the Nth occurrence starting from the left or right end of the string,
ignore case in the old string when doing a match,
and to be a subroutine for efficiency but a function for convenience. For example: replace. Starting at the other end should probably be implemented as with INDEX() and so on. That particular routine has had feature creep over the years and could be written to be more efficient but has the basic functionality I have found useful.

ivan-pi commented 3 years ago

Python
- replace()
  
  str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Javascript
- .replace()
  
  const newStr = str.replace(regexp|substr, newSubstr|function) The replace() method returns a new string with some or all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match. If pattern is a string, only the first occurrence will be replaced.
  
  The original string is left unchanged.
- .replaceAll()
  
  const newStr = str.replaceAll(regexp|substr, newSubstr|function) The replaceAll() method returns a new string with all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match.
  
  The original string is left unchanged.
Julia
- Base.replace — Method
  
  replace(s::AbstractString, pat=>r; [count::Integer]) Search for the given pattern pat in s, and replace each occurrence with r. If count is provided, replace at most count occurrences. pat may be a single character, a vector or a set of characters, a string, or a regular expression. If r is a function, each occurrence is replaced with r(s) where s is the matched substring (when pat is a AbstractPattern or AbstractString) or character (when pat is an AbstractChar or a collection of AbstractChar). If pat is a regular expression and r is a SubstitutionString, then capture group references in r are replaced with the corresponding matched text. To remove instances of pat from string, set r to the empty String ("").
Perl
- substr EXPR,OFFSET,LENGTH,REPLACEMENT
The Perl substring is interesting in that it can be used both on the right or left side of an assignment. So you can use it to return a new string (substr on the right side), or replace a substring in place (substr on the left side). In Fortran this would require a function and subroutine, as suggested by @urbanjost.

I also support the options suggested by @urbanjost:

count or repeat (replace only first count appearances)
ignore_case
occurence (if present, start changing at the Nth occurrence of the OLD string)

but I would prefer to split them into two functions.

ChetanKarwa commented 3 years ago

Function introduced by @urbanjost can work as an all in one pack for any kind of "replace" operation that a user wants to execute. I support such an all in one function.

But I have a doubt, in deciding the Nth occurrence. For eg: if the parent string is "abababababa" and the old string is "aba"

What will be the 3rd occurrence of old in this parent
- abab aba baba
- abababab aba

ivan-pi commented 3 years ago

What will be the 3rd occurrence of old in this parent

abab aba baba

abababab aba

Python uses a non-overlapping count:

>>> "abababababa".replace("aba","zzz")
'zzzbzzzbzzz'
>>> "abababababa".replace("aba","zzz",1)
'zzzbabababa'
>>> "abababababa".replace("aba","zzz",2)
'zzzbzzzbaba'

which corresponds to your second bullet point.

Edit: The StringiFor library also provides a replace method matching the Python one.

ChetanKarwa commented 3 years ago

If everyone is fine with the idea. I wish to start working on this issue. I can start off by implementing the function proposed by @urbanjost.

awvwgk commented 3 years ago

Feel free to start working on a patch, @ChetanKarwa . You can either start from the default branch and we will figure out rebasing the patch later, or you could base your feature branch on the patch in #343, which already creates the stdlib_strings module.

Let me know if you need any help with this.

ChetanKarwa commented 3 years ago

I wanted to ask these questions:

Shall I implement the KMP algorithm that searches in O(N) time and O(N) space complexity or a Naive algorithm that takes O(N^2) time complexity and no extra space.
Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.
How should the function be accessed.
- replace(parent_str, old_str, new_str)
- parent_str.replace(old_str,new_str)

ivan-pi commented 3 years ago

Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.

I suggest you omit cmd for now.

How should the function be accessed.

replace(parent_str, old_str, new_str)

parent_str.replace(old_str,new_str)

The first one. (For the second version you would need first a string class (#333) which is not available yet. Moreover, the second version would likely just call the first version.) More importantly I think replace will be a function and not a subroutine.

If KMP is not too challenging I think that would be a great start. I guess in practice there will be some trade-offs (e.g. depending on the string sizes, cache size, etc.).

urbanjost commented 3 years ago

I meant to mention not to consider the "cmd" option. I consider the other functionality desirable (ignore case of the old string, replace right to left as well as left to right, some type of selection of which occurrence(s) to replace).

The replace() function mentioned has a lot of history, and was originally part of a line editor so "cmd" preceded having an old and new string.Although still used in my case I would not recommend it in a new implementation. I used it as an example of time-tested desirable features, not as an implementation model. For the cases it was intended for the volume of data being changed did not require much efficiency, but I can see use cases where efficiency would be desirable,

A KMP implementation would potentially be reusable for other functions ( I would be curious to time it against INDEX(3f) in different compilers too) as well, and seems worth implementing. I would be curious if anyone has a known use case for processing large amounts of data with replace(3f) or for searching for words efficiently in general in other proposed stdlib procedures.

I would implement a request to search from right to left with a logical called BACK to be consistent with intrinsics such as INDEX(3f), VERIFY(3f), and SCAN(3f) for consistency instead of using the sign of the occurrence.

arjenmarkus commented 3 years ago

An implementation of "replace all" should be careful with overlapping strings. Consider the following example with @Aman-Godara 's option 2: The string qwqwq and replacing qwq by abc to give abcabcw.

If the replacing string is qwqwq you would get an infinitely long string.

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I find it very convenient in my Tcl programs and I could probably use it in Fortran as well.

ChetanKarwa commented 3 years ago

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

arjenmarkus commented 3 years ago

I did not mean to suggest that the current round should include a Fortran version of this functionality ;). Just wanted to add it to the wishlist.

Op wo 31 mrt. 2021 om 14:12 schreef ChetanKarwa @.***>:

In addition to the above examples from other languages, I would like to point out the command string map from Tcl:

set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811020539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRYVLQN5DJBRTB2WFLDTGMGSNANCNFSM4Z5JHY4A .

ivan-pi commented 3 years ago

In addition to the above examples from other languages, I would like to point out the command string map from Tcl: set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.

I didn't get this part, would anyone mind explaining with a proper example of the use of function replace

You can find a complete explanation in the Tcler wiki: https://wiki.tcl-lang.org/page/string+map

Translated to Fortran it might look like

a = "A B C A"
b = replace(a,[map("A","AA"),map("B","BB")])

It is a more powerful version of replace which allows to map multiple substrings in one pass. The rule to prevent overlap issues is

Replacement is done in an ordered manner, so the key appearing first in the list will be checked first, and so on. string is only iterated over once, so earlier key replacements will have no effect for later key matches.

But this should go to a future PR.

urbanjost commented 3 years ago

although TCL calls them keys and values think of the map list as just a set of old1,new1 old2,new2, old3,new3 so that would be equivalent to multiple calls to REPLACE(3f) assuming REPLACE(3f) is not elemental, except that it is all done in one pass with respect to the original string I think. Actually not sure which way TCL did that, and thought TCL would retain the spaces so it has been too long since I used TCL but either way it raises the question of allowing multiple pairs or old and new values in one call, and if you did would the replacements be cumulative or not? ,, So if you started with the string "AB" and said you wanted to replace "A" with "B" and "B" with "XXX" if you did it one pair at a time and used the output of previous changes as input to new changes (like you called REPLACE(3f) twice) you would get AB=>BB by the first replacement, and then BB=>XXXXXX; but if you did all replacements relative to the original input string you would get "BXXX". So if you allow multiple OLD,NEW pairs you have at least those two possible outcomes to consider. Again, I have not used TCL in a long time, but irregardless if you allow multiple pairs you have to decide if the result should be identical to multiple calls to REPLACE(3f) or if all the substitutions would be in reference to the original string, more like variable name substitution in the bash(1) shell, for example.

aman-godara commented 3 years ago

I wish to work on this issue. Please assign this to me.

Also, we have intrinsic index function in fortran and some of the above mentioned features like replace_all and getting the index of the 3rd occurrence of the substring in the input string when implemented efficiently require manipulating the index function internally. So we will have to implement index function from scratch. Also, @awvwgk pointed out that we cannot name this new function as index. I think may be because its features conflicts with the intrinsic index function. Please also note: We currently have index function in our stdlib_string_type module as well.

awvwgk commented 3 years ago

Maybe there is a better name like find_substring or just find instead of index? If the interface turns out similar enough to the intrinsic index function we can think about overloading the index function with this functionality and eventually propose it as addition to the Fortran standard.

aman-godara commented 3 years ago

Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all. I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review). But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.

But tcl's replace_map functionality is still not added in this replace_all. I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"} will it be Option 1). "wpqrsdsw" Option 2). "wpqrsw"?

Taking example from @urbanjost's comment. I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.

arjenmarkus commented 3 years ago

I just tried your example and the answer is option 2.

Op di 15 jun. 2021 om 21:27 schreef Aman Godara @.***>:

Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all. I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review). But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.

But tcl's replace_map functionality is still not added in this replace_all. I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"} will it be Option 1). "wpqrsdsw" Option 2). "wpqrsw"?

Taking example from @urbanjost https://github.com/urbanjost's comment https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811040307 . I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-861772550, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR2KCZH6HYNDOPPTOKTTS6SR5ANCNFSM4Z5JHY4A .

aman-godara commented 3 years ago

I tried an online compiler for Tcl (by tutorials point) to understand more about the behaviour of the function.

Here is the code:

set a "wababw"
set a [string map {"abab" "ppp" "abw" "tt"} $a]
puts $a

output is: wpppw.

It confirmed that overlapping substitutions are avoided.

Also, one thing to note is when I tried

set a "a b c a"
set a [string map {"a" "aa" "b" "bb"} $a]
puts $a

Output retained the spaces: "aa bb c aa" for this online compiler (it says Tcl v8.6.6) and tried few other similar operations to conclude that the length of the output may not be equal to that of input.

Conclusion: The "hacky way" discussed in here will work everytime to deliver the behaviour of Tcl's replace_map functionality, only hurdle is to very carefully come up with dummy character(s).

But IMO there can be other ways to deliver replace_map functionality. Inspiration Aho_Corasick_Algorithm but not exactly this.

fortran-lang / stdlib

Replace substring in character/string_type #366

Gsub

Sub