[Bug]: Regex evaluation - wrong results

apache / hop

Hop Orchestration Platform

https://hop.apache.org/

Apache License 2.0

985 stars 354 forks source link

[Bug]: Regex evaluation - wrong results #4585

Closed dave-csc closed 1 day ago

dave-csc commented 3 days ago

Apache Hop version?

2.10.0

Java version?

17.0.2

Operating system

Linux

What happened?

The Regex evaluation transform seems to evaluate correctly a regular expression only if this matches the entire input field, and not part of it.

Example 1:

Input field: THIS IS A TITLE <PROCESSING_TAG>
Regex 1: .*<(.*)> -> returns correctly a match for the whole string and the group PROCESSING_TAG
Regex 2: <(.*)> -> returns no match, should instead match the string <PROCESSING_TAG> and the group PROCESSING_TAG

Example 2:

Input field: RSSMRA70A01H501X
Regex 1: [A-Z]{6}.* -> return correctly a match for the whole string
Regex 2: [A-Z]{6} -> returns no match, should instead match the first 6 letters RSSMRA

Issue Priority

Priority: 3

Issue Component

Component: Transforms

hansva commented 3 days ago

I am not sure what the ask here is. It's definitely not a bug. What is the goal you are trying to achieve @dave-csc ?

The best way I have found to evaluate Java regex is the following site. https://www.regexplanet.com/advanced/java/index.html

When I use your first example with the second regex we get the following result.

What we check on in the transform is the "Matches()" result and if requested we return the group.

dave-csc commented 3 days ago

Hi @hansva,

for my tests I used https://regex101.com/ and it gave the results described in the first post.

The difference is in what we intend as a "match" (if the input matches the pattern in whole, or just contains it), and even Java itself offers various methods to determine the different types of match (cfr. https://www.baeldung.com/java-matcher-find-vs-matches). In this case, probably a slight documentation improvement is needed.

I don't know which implementation is "safer" for data analysis, but to specify a full string match I would explicitly supply a RegEx with markers like ^...$. Without those markers I'd expect to check if the pattern is just contained in the string.

BTW, my goal was actually the extract the PROCESSING_TAG text (without parentheses). I made it with the Regex 1 above, when using Regex 2 I got the unexpected (for me) no match, hence the bug report.

hansva commented 3 days ago

Correct, I think the documentation could use some improvements. In this case the transform is an evaluation, and the assumption is that it evaluates the whole field. The same way as a filter would behave. I think the original mistake was to add capture groups to this transform. It would have been cleaner to have a Regex Extract transform or something similar.

Another option would be to add an option to the transform to use find(). an option like "allow partial matches" or something like that.

dave-csc commented 2 days ago

I'm proposing a documentation update (see pull request #4590).

[side note: writing regular expressions in AsciiDoc and make them correctly readable is a P-A-I-N 😓]