Closed dave-csc closed 1 day ago
I am not sure what the ask here is. It's definitely not a bug. What is the goal you are trying to achieve @dave-csc ?
The best way I have found to evaluate Java regex is the following site. https://www.regexplanet.com/advanced/java/index.html
When I use your first example with the second regex we get the following result.
What we check on in the transform is the "Matches()" result and if requested we return the group.
Hi @hansva,
for my tests I used https://regex101.com/ and it gave the results described in the first post.
The difference is in what we intend as a "match" (if the input matches the pattern in whole, or just contains it), and even Java itself offers various methods to determine the different types of match (cfr. https://www.baeldung.com/java-matcher-find-vs-matches). In this case, probably a slight documentation improvement is needed.
I don't know which implementation is "safer" for data analysis, but to specify a full string match I would explicitly supply a RegEx with markers like ^...$
. Without those markers I'd expect to check if the pattern is just contained in the string.
BTW, my goal was actually the extract the PROCESSING_TAG
text (without parentheses). I made it with the Regex 1 above, when using Regex 2 I got the unexpected (for me) no match, hence the bug report.
Correct, I think the documentation could use some improvements. In this case the transform is an evaluation, and the assumption is that it evaluates the whole field. The same way as a filter would behave. I think the original mistake was to add capture groups to this transform. It would have been cleaner to have a Regex Extract transform or something similar.
Another option would be to add an option to the transform to use find(). an option like "allow partial matches" or something like that.
I'm proposing a documentation update (see pull request #4590).
[side note: writing regular expressions in AsciiDoc and make them correctly readable is a P-A-I-N 😓]
Apache Hop version?
2.10.0
Java version?
17.0.2
Operating system
Linux
What happened?
The Regex evaluation transform seems to evaluate correctly a regular expression only if this matches the entire input field, and not part of it.
Example 1:
THIS IS A TITLE <PROCESSING_TAG>
.*<(.*)>
-> returns correctly a match for the whole string and the groupPROCESSING_TAG
<(.*)>
-> returns no match, should instead match the string<PROCESSING_TAG>
and the groupPROCESSING_TAG
Example 2:
RSSMRA70A01H501X
[A-Z]{6}.*
-> return correctly a match for the whole string[A-Z]{6}
-> returns no match, should instead match the first 6 lettersRSSMRA
Issue Priority
Priority: 3
Issue Component
Component: Transforms