Out of spec behavior of the xforms regex function

getodk / javarosa

The core library that many of the ODK tools are built around. It's written in Java, implements the ODK XForms spec, and runs on mobile devices and cloud servers. ✨🏗✨

Other

54 stars 106 forks source link

Out of spec behavior of the xforms regex function #531

Open ggalmazor opened 4 years ago

ggalmazor commented 4 years ago

Problem description

This is what the ODK XForms specs say about the regex(string value, string expression) --> boolean

Returns result of regex test on provided value. The regular expression is created from the provided expression string ('[0-9]+' becomes /[0-9]+/).

This means that the expression should match any part of the provided value unless ^ and $ anchors are used in the provided expression.

However, JavaRosa matches the whole value, as if the provided expression always included the ^ and $ anchors.

Steps to reproduce the problem

Apply this patch odk_xforms_regex_regression.zip and run XFormsRegexTest:

@Test
public void name() throws IOException {
    Scenario scenario = Scenario.init("Some form", html(
        head(
            title("Some form"),
            model(
                mainInstance(t("data id=\"some-form\"",
                    t("regex-result-1"),
                    t("regex-result-2")
                )),
                bind("/data/regex-result-1").type("boolean").calculate("regex('cocotero', 'cocotero')"),
                bind("/data/regex-result-2").type("boolean").calculate("regex('cocotero', 'te')")
            )
        )
    ));
    assertThat(scenario.answerOf("/data/regex-result-1"), is(booleanAnswer(true)));
    assertThat(scenario.answerOf("/data/regex-result-2"), is(booleanAnswer(true)));
}

Expected behavior

Both boolean fields should contain true after form initialization.

Other information

JavaRosa implements the regex function in XPathFunxExpr.regex(Object, Object) (line 1267):

public static Boolean regex(Object o1, Object o2) {
    String str = toString(o1);
    String re = toString(o2);

    return Pattern.matches(re, str);
}

The used Pattern.matches(String, CharSequence) method matches against the whole string. Changing it to return Pattern.compile(re).matcher(str).find(); should do the trick.

ggalmazor commented 4 years ago

We need to discuss whether it would be safe to fix this bug in case users rely on JavaRosa's current behavior, even if it's not up to spec.

@tiritea commented in Slack:

I'd like to bring @lognaturel and @martijnr into the loop here. My concern is that 'fixing' this bug could potentially break a lot of existing forms out there. Specifically, a newbie may well write a constraint "IDs must be 10 alphanumeric characters" as regex(., '[A-Za-z0-9]{10}'). And by and large this would appear to work, but unless you thoroughly tested it you wouldn't see that your regex would also match 11+ characters too!

As regex() is non-standard in any case, a possible option is to deprecate it, in favor of the XPath2.0 matches() function: https://www.w3.org/TR/xquery-operators/#func-matches which as specified does the proper substring match. [and add a HUGE warning that Collect's regex() performs a full string match only!]

But end of day, we currently have a highly undesirable different behavior between Collect and Enketo when it comes to regex() :tired_face:

I do think javaRosa's regex() is broken, but I'm not convinced fixing it - ie using java Pattern.find() instead - is necessarily the best solution. Thoughts y'all?

tiritea commented 4 years ago

additional background from slack:

so it looks like there might be a 'bug' in javarosa regex, in that your provided regex expression must match the entire string - ie implicitly adding a ^...$ around the expresion - as opposed to finding any substring that matches (which is the correct behavior, and which Enketo does). Specifically, in the code https://github.com/opendatakit/javarosa/blob/master/src/main/java/org/javarosa/xpath/expr/XPathFuncExpr.java:public

static Boolean regex(Object o1, Object o2) {  
 String str = toString(o1);    
 String re = toString(o2);    
 return Pattern.matches(re, str);  
}

but java.util.regex.Pattern.matches() is defined as:

public boolean matches()

Attempts to match the *entire* region against the pattern.

So I think we should be using public java.util.regex.Pattern.find() instead.

public boolean find()

Attempts to find the next subsequence of the input sequence that matches the pattern.

lognaturel commented 4 years ago

I have indeed never noticed this inconsistency. The historical context is that when JavaRosa was created, there wasn't really any consideration to compatible engines and a number of decisions it made are fairly Java-centric (down to the name). As far as I can tell, the form specification part of the OpenRosa standards was not particularly well-documented. @martijnr will need to confirm or deny this since I wasn't there but my understanding is that when he started building Enketo, he largely worked off of JR source and bits and pieces of documentation here and there and then he's the one who wrote the first version of https://opendatakit.github.io/xforms-spec/ (thank you!).

My guess is that the header comment on XPathFuncExpr.regex uses "matches" in the Java sense but @martijnr understandably took it differently. I believe the intent was as implemented -- to only consider entire matches. This is what users typically want and having to remember to add anchors is confusing for non-technical users. As @tiritea notes, leaving the possibility for an unintended partial match is dangerous. Since the spec was written based on an existing implementation in that case, I would consider the specification to be incorrectly expressed. If possible, my preference would be for modifying the spec for regex. If we really believe that there's a need for a partial match -- do you actually have a need for it, @tiritea? -- we could add the XPath 2.0 matches with a very explicit warning about how it differs from regex.

MartijnR commented 4 years ago

Ah sorry about my mistake. I had no idea about the difference in implementation either and years back probably would have written the spec for regex() differently if I had realized that Javarosa's implementation was like this (also in Commcare's spec @ctsims ).

I do actually prefer the spec as currently written though since it is more powerful and I don't have any problem at all with requiring a ^, $ to be added manually by users (since regexes are complex anyway). If starting from scratch I would have advocated for an implementation like matches().

However, the issue is changing behavior for either Enketo or Collect. Not sure what the best solution is atm.

tiritea commented 4 years ago

I’d probably say we should make the non-standard regex() perform a full match, as Collect does today (sorry, more work for you @MartijnR), fix the ODK spec to reflect that the expression will be effectively anchored with ^ and $ whether you like it or not. And implement XPath matches() function on both platforms [its a real PITA to do a substring match right now if that is actually what you need].

What say thee?

tiritea commented 4 years ago

And yes, there are situations when a partial match is required; this all came about from a Kobo user trying to figure out why his Enketo constraint was working but didn’t in KoboCollect [so no, it wasn’t just me being overly pedantic, per usual... :-) ]

tiritea commented 4 years ago

I must add, to acknowledge @MartijnR 's hard work in 'correctly' implementing regex() in Enketo - and now having to potentially 'break it' to maintain compatibility with Collect/javaRosa - that...

No good deed goes unpunished.

😁

ctsims commented 4 years ago

For clarity on our side, our implementation matches the expected behavior, and I believe always has.

It looks like there was drift in the ODK fork at some point historically (Github's history for the repo only goes back April 2016, so hard to say ) when transitioning from the wonky "regexme" microedition regular expression library onto core Java regex support in which the behavior was changed from a pattern match to a full text match.

As such, we're always happy for changes which bring the two library's functioning closer together! In the short term thanks for flagging me in case we want to migrate functionality explicitly for imported forms.

lognaturel commented 4 years ago

Thanks for reporting back, @ctsims! Well, that's frustrating. https://github.com/opendatakit/javarosa/commit/389f5abeb26c1ad35a5e5d29cdd83e26673994ce is the bad commit. So yes, indeed, the JavaRosa behavior is not as it was initially intended to be. I apologize for speculating incorrectly on when the deviation may have come in.

The change to the incorrect behavior dates back to 2014. Perhaps we just say we're fixing a bug and hope that users read the documentation and anchored their expressions as needed. The risk is that we'd be letting through matches with spurious characters around them but that's something that can be recovered from in analysis in at least some cases.

EDIT: no, it's too dangerous to modify this behavior. We'd at the very least need to have to have some kind of warning period, I think. Perhaps introducing matches as an alternative would be reasonable.

adrianocunha3000 commented 3 years ago

I think one new function with the correct behavior (requiring "^$" for full match) should be added and another new function with the current behavior should also be offered as an alternative (with warnings that it may be deprecated [1]) and the current bugged regex() function should be deprecated (transition period and all) and the name regex() retired.

You have to pay the piper - the sooner the better...

[1] Or not - it may be a new feature.