Open ivan-pi opened 3 years ago
This has also been requested a few times before.
@zbeekman wrote in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570690516:
sub: from Ruby, replaces the first occurance of a substring with a substitution, optional argument to replace the last string gsub: Replaces all occurrences with a substring
@jacobwilliams suggested it also in https://github.com/j3-fortran/fortran_proposals/issues/96#issuecomment-557896465:
Back to original topic. Here's one I use all the time:
1. `string_replace` 2. Replace all occurrences of one string with another 3. Yes, Python has this:
>>> 'aaaa'.replace('a','bb') 'bbbbbbbb'
1. Something along these lines:
character(len=:),allocatable :: str str = 'aaaa' call string_replace(str,'a','bb') write(*,*) str ! writes bbbbbbbb
There is also a proposal from the string thread:
Gsub
- Global substitution. Same as @jacobwilliams'
string_replace
as far as I can tell. Eventually it would be nice to support regular expressions like Ruby, but, for now, we should start smaller and just make it behave like:sed 's/<pattern>/<replacement>/g'
- Kinda Ruby, Python and
sed
. Behave like Python without the optionalcount
argument- Fortran:
gsub("hello", "l", "L") ! "heLLo" gsub("the quick brown fox jumped over the lazy dog", "the", "a") ! "a quick brown fox jumped over a lazy dog" !! Or OO approach s = "hello" s%sub("l","L") ! "heLLo"
Sub
- Replace first (or last) instance of one substring with another
- Again, similarities with Python and Ruby, but not an exact match
- Fortran:
sub("hello", "l", "L") ! "heLlo" sub("hello", "l", "L", back=.true.) ! "helLo" sub("the quick brown fox jumped over the lazy dog", "the", "a") ! "a quick brown fox jumped over the lazy dog" sub("the quick brown fox jumped over the lazy dog", "the", "a", back=.true.) ! "the quick brown fox jumped over a lazy dog" !! Or OO approach s = "hello" s%sub("l","L") ! "heLlo" s%sub("l","L", back=.true.) ! "helLo"
Originally proposed by @zbeekman in https://github.com/fortran-lang/stdlib/issues/69#issuecomment-570725303
We have to be a bit careful with the string related patches, because there are only a few modules and many independent features. This will likely result in many conflicting patches and will require a lot of rebasing along the way. The module to put those functions in would be stdlib_strings
, which is not yet created on the default branch.
Third implementation: string.replaceall(pos, old_string, new_string)
- This meaning of variables is similar to the second implementation.
- The difference is it will find and replace all the occurrences of old_string after pos and not just one.
What would we expect the function to return if the input is "qwqwqw"
and we want to replace all "qwq"
with let's say "abc"
?
Option 1: "abcwqw"
Option 2: "abcabcw"
(here length of the output is greater than the length of the input)
[EDIT]: Option 3: abcw
Option 2 is inspired from the fact that the second q
is part of two occurrences of qwq
. I cannot think of a particular use case where a user might want option 2 to be the output but it occurred to me so I thought it would be better to put it on the table.
features I would like would be to
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
Javascript
const newStr = str.replace(regexp|substr, newSubstr|function)
The replace() method returns a new string with some or all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match. If pattern is a string, only the first occurrence will be replaced.The original string is left unchanged.
const newStr = str.replaceAll(regexp|substr, newSubstr|function)
The replaceAll() method returns a new string with all matches of a pattern replaced by a replacement. The pattern can be a string or a RegExp, and the replacement can be a string or a function to be called for each match.The original string is left unchanged.
replace(s::AbstractString, pat=>r; [count::Integer])
Search for the given pattern pat in s, and replace each occurrence with r. If count is provided, replace at most count occurrences. pat may be a single character, a vector or a set of characters, a string, or a regular expression. If r is a function, each occurrence is replaced with r(s) where s is the matched substring (when pat is a AbstractPattern or AbstractString) or character (when pat is an AbstractChar or a collection of AbstractChar). If pat is a regular expression and r is a SubstitutionString, then capture group references in r are replaced with the corresponding matched text. To remove instances of pat from string, set r to the empty String ("").
Perl
The Perl substring is interesting in that it can be used both on the right or left side of an assignment. So you can use it to return a new string (substr on the right side), or replace a substring in place (substr on the left side). In Fortran this would require a function and subroutine, as suggested by @urbanjost.
I also support the options suggested by @urbanjost:
count
or repeat
(replace only first count
appearances)ignore_case
occurence
(if present, start changing at the Nth occurrence of the OLD string)but I would prefer to split them into two functions.
Function introduced by @urbanjost can work as an all in one pack for any kind of "replace" operation that a user wants to execute. I support such an all in one function.
But I have a doubt, in deciding the Nth occurrence. For eg: if the parent string is "abababababa" and the old string is "aba"
What will be the 3rd occurrence of old in this parent
- abab aba baba
- abababab aba
Python uses a non-overlapping count:
>>> "abababababa".replace("aba","zzz")
'zzzbzzzbzzz'
>>> "abababababa".replace("aba","zzz",1)
'zzzbabababa'
>>> "abababababa".replace("aba","zzz",2)
'zzzbzzzbaba'
which corresponds to your second bullet point.
Edit: The StringiFor library also provides a replace
method matching the Python one.
If everyone is fine with the idea. I wish to start working on this issue. I can start off by implementing the function proposed by @urbanjost.
Feel free to start working on a patch, @ChetanKarwa . You can either start from the default branch and we will figure out rebasing the patch later, or you could base your feature branch on the patch in #343, which already creates the stdlib_strings
module.
Let me know if you need any help with this.
I wanted to ask these questions:
- Do we need the argument 'cmd' as given in the post by @urbanjost or just keeping old and new is fine enough.
I suggest you omit cmd
for now.
How should the function be accessed.
- replace(parent_str, old_str, new_str)
- parent_str.replace(old_str,new_str)
The first one. (For the second version you would need first a string class (#333) which is not available yet. Moreover, the second version would likely just call the first version.) More importantly I think replace
will be a function and not a subroutine.
If KMP is not too challenging I think that would be a great start. I guess in practice there will be some trade-offs (e.g. depending on the string sizes, cache size, etc.).
I meant to mention not to consider the "cmd" option. I consider the other functionality desirable (ignore case of the old string, replace right to left as well as left to right, some type of selection of which occurrence(s) to replace).
The replace() function mentioned has a lot of history, and was originally part of a line editor so "cmd" preceded having an old and new string.Although still used in my case I would not recommend it in a new implementation. I used it as an example of time-tested desirable features, not as an implementation model. For the cases it was intended for the volume of data being changed did not require much efficiency, but I can see use cases where efficiency would be desirable,
A KMP implementation would potentially be reusable for other functions ( I would be curious to time it against INDEX(3f) in different compilers too) as well, and seems worth implementing. I would be curious if anyone has a known use case for processing large amounts of data with replace(3f) or for searching for words efficiently in general in other proposed stdlib procedures.
I would implement a request to search from right to left with a logical called BACK to be consistent with intrinsics such as INDEX(3f), VERIFY(3f), and SCAN(3f) for consistency instead of using the sign of the occurrence.
An implementation of "replace all" should be careful with overlapping strings. Consider the following example with @Aman-Godara 's option 2:
The string qwqwq
and replacing qwq
by abc
to give abcabcw
.
If the replacing string is qwqwq
you would get an infinitely long string.
In addition to the above examples from other languages, I would like to point out the command string map
from Tcl:
set a "A B C A"
string map {"A" "AA" "B" "BB"} $a
results in AABBCAA
and care is taken to avoid overlapping substitutions.
I find it very convenient in my Tcl programs and I could probably use it in Fortran as well.
In addition to the above examples from other languages, I would like to point out the command string map from Tcl:
set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.
I didn't get this part, would anyone mind explaining with a proper example of the use of function replace
I did not mean to suggest that the current round should include a Fortran version of this functionality ;). Just wanted to add it to the wishlist.
Op wo 31 mrt. 2021 om 14:12 schreef ChetanKarwa @.***>:
In addition to the above examples from other languages, I would like to point out the command string map from Tcl:
set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.
I didn't get this part, would anyone mind explaining with a proper example of the use of function replace
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811020539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRYVLQN5DJBRTB2WFLDTGMGSNANCNFSM4Z5JHY4A .
In addition to the above examples from other languages, I would like to point out the command string map from Tcl: set a "A B C A" string map {"A" "AA" "B" "BB"} $a results in AABBCAA and care is taken to avoid overlapping substitutions.
I didn't get this part, would anyone mind explaining with a proper example of the use of function replace
You can find a complete explanation in the Tcler wiki: https://wiki.tcl-lang.org/page/string+map
Translated to Fortran it might look like
a = "A B C A"
b = replace(a,[map("A","AA"),map("B","BB")])
It is a more powerful version of replace which allows to map multiple substrings in one pass. The rule to prevent overlap issues is
Replacement is done in an ordered manner, so the key appearing first in the list will be checked first, and so on. string is only iterated over once, so earlier key replacements will have no effect for later key matches.
But this should go to a future PR.
although TCL calls them keys and values think of the map list as just a set of old1,new1 old2,new2, old3,new3 so that would be equivalent to multiple calls to REPLACE(3f) assuming REPLACE(3f) is not elemental, except that it is all done in one pass with respect to the original string I think. Actually not sure which way TCL did that, and thought TCL would retain the spaces so it has been too long since I used TCL but either way it raises the question of allowing multiple pairs or old and new values in one call, and if you did would the replacements be cumulative or not? ,, So if you started with the string "AB" and said you wanted to replace "A" with "B" and "B" with "XXX" if you did it one pair at a time and used the output of previous changes as input to new changes (like you called REPLACE(3f) twice) you would get AB=>BB by the first replacement, and then BB=>XXXXXX; but if you did all replacements relative to the original input string you would get "BXXX". So if you allow multiple OLD,NEW pairs you have at least those two possible outcomes to consider. Again, I have not used TCL in a long time, but irregardless if you allow multiple pairs you have to decide if the result should be identical to multiple calls to REPLACE(3f) or if all the substitutions would be in reference to the original string, more like variable name substitution in the bash(1) shell, for example.
I wish to work on this issue. Please assign this to me.
Also, we have intrinsic index
function in fortran and some of the above mentioned features like replace_all
and getting the index of the 3rd occurrence of the substring
in the input string
when implemented efficiently require manipulating the index
function internally. So we will have to implement index
function from scratch.
Also, @awvwgk pointed out that we cannot name this new function as index
. I think may be because its features conflicts with the intrinsic index
function.
Please also note: We currently have index
function in our stdlib_string_type
module as well.
Maybe there is a better name like find_substring
or just find
instead of index
? If the interface turns out similar enough to the intrinsic index
function we can think about overloading the index
function with this functionality and eventually propose it as addition to the Fortran standard.
Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib
. I decided to implement replace_all
.
I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos
argument at this moment because the same operation can be done using a combination of slice
(already added to stdlib
) and find
(under review).
But I added replace_overlapping
feature to replace_all
because this cannot be achieved in any other way.
But tcl's replace_map
functionality is still not added in this replace_all
.
I wonder what would the output be for Tcl's replace_map
function if the input string is "wababw"
and the input map is {"abab" -> "pqrs", "ba" -> "ds"}
will it be
Option 1). "wpqrsdsw"
Option 2). "wpqrsw"
?
Taking example from @urbanjost's comment.
I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all
call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all
twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.
I just tried your example and the answer is option 2.
Op di 15 jun. 2021 om 21:27 schreef Aman Godara @.***>:
Since we always put a low-level functionality forward first for a new functionality that we wish to add to stdlib. I decided to implement replace_all. I personally try to design these low-level functionality such that all possible high-level operations can still be done using a combination of many low-level functionalities. For eg: I decided to not add pos argument at this moment because the same operation can be done using a combination of slice (already added to stdlib) and find (under review). But I added replace_overlapping feature to replace_all because this cannot be achieved in any other way.
But tcl's replace_map functionality is still not added in this replace_all. I wonder what would the output be for Tcl's replace_map function if the input string is "wababw" and the input map is {"abab" -> "pqrs", "ba" -> "ds"} will it be Option 1). "wpqrsdsw" Option 2). "wpqrsw"?
Taking example from @urbanjost https://github.com/urbanjost's comment https://github.com/fortran-lang/stdlib/issues/366#issuecomment-811040307 . I think one possible hacky way of replacing "AB" using the map {"A" -> "B", "B" -> "XXX"} would be to replace "A" with a dummy character let's say "$" and "B" with let's say "#" and do the normal replace_all call twice i.e. once for each of the two replacements. This will give us "$#". Now replace "$" with "B" and then "#" with "XXX", again by calling replace_all twice. So at the end we will have "BXXX". In short, hacky solution first takes the input to an intermediate stage and from this intermediate stage it takes to the final solution.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/366#issuecomment-861772550, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR2KCZH6HYNDOPPTOKTTS6SR5ANCNFSM4Z5JHY4A .
I tried an online compiler for Tcl (by tutorials point) to understand more about the behaviour of the function.
Here is the code:
set a "wababw"
set a [string map {"abab" "ppp" "abw" "tt"} $a]
puts $a
output is: wpppw
.
It confirmed that overlapping substitutions are avoided.
Also, one thing to note is when I tried
set a "a b c a"
set a [string map {"a" "aa" "b" "bb"} $a]
puts $a
Output retained the spaces: "aa bb c aa"
for this online compiler (it says Tcl v8.6.6) and tried few other similar operations to conclude that the length of the output may not be equal to that of input.
Conclusion:
The "hacky way" discussed in here will work everytime to deliver the behaviour of Tcl's replace_map
functionality, only hurdle is to very carefully come up with dummy character(s).
But IMO there can be other ways to deliver replace_map
functionality. Inspiration Aho_Corasick_Algorithm but not exactly this.
I also propose to add a few other functions that can be commonly used.
Replace function, this can be implemented in several ways:
We already have an operator to Concat a string with another but I feel we should also implement a function for the same.
The community can have a discussion over the need for such functions and if needed also suggest some other implementation of Replace function.
Originally posted by @ChetanKarwa in https://github.com/fortran-lang/stdlib/issues/364#issuecomment-808800508