OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.78k stars 1.95k forks source link

P304 as reference breaks the Wikidata schema #3274

Closed olea closed 3 years ago

olea commented 3 years ago

Hi:

Here using OR v3.4 in Linux. Uploading data to Wikidata.

I found using the P304 brakes the schema: the UI doesn't provide any clear feedback and when looking at the logs I got this:

21:18:01.030 [                  command] Exception caught (15ms)
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 6
(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*
      ^
    at java.util.regex.Pattern.error(Pattern.java:1969)
    at java.util.regex.Pattern.group0(Pattern.java:2908)
    at java.util.regex.Pattern.sequence(Pattern.java:2065)
    at java.util.regex.Pattern.expr(Pattern.java:2010)
    at java.util.regex.Pattern.group0(Pattern.java:2919)
    at java.util.regex.Pattern.sequence(Pattern.java:2065)
    at java.util.regex.Pattern.expr(Pattern.java:2010)
    at java.util.regex.Pattern.compile(Pattern.java:1702)
    at java.util.regex.Pattern.<init>(Pattern.java:1352)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.getPattern(FormatScrutinizer.java:68)
    at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.scrutinize(FormatScrutinizer.java:80)
    at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinizeSnakSet(SnakScrutinizer.java:71)
    at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinize(SnakScrutinizer.java:64)
    at org.openrefine.wikidata.qa.scrutinizers.StatementScrutinizer.scrutinize(StatementScrutinizer.java:36)
    at org.openrefine.wikidata.qa.EditInspector.inspect(EditInspector.java:107)
    at org.openrefine.wikidata.commands.PreviewWikibaseSchemaCommand.doPost(PreviewWikibaseSchemaCommand.java:92)
    at com.google.refine.RefineServlet.service(RefineServlet.java:187)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
    at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
    at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I'm not sure if this happens with any value for P304. In my case the precise value was 69-91

An example of the reference I pretend is, exactly, the one for P31 at Q100407249.

Without using P304 everything seems to work as expected.

Hope this helps.

wetneb commented 3 years ago

Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

The regular expression currently present on P304 is

(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*

which is incorrect in all regex variants known to https://regex101.com/

wetneb commented 3 years ago

Note that we should also evaluate whether there are security vulnerabilities here: there might be options in Pattern.compile to reject regular expressions whose compilation/evaluation would be too expensive (DoS-like attack).

VDK commented 3 years ago

I have the same issue though I was using page(s) as a qualifier. The preview tab states "Your schema is incomplete, fix it to see the preview."

wetneb commented 3 years ago

@VDK sorry about this! You should be able to temporarily fix this by fixing the regular expression in the property constraint of "page(s)".

olea commented 3 years ago

Just for the record:

The bug is reported at Wikidata P304 discussion page.

verdy-p commented 3 years ago

Something was changed in the support of non-capturing groups (in the past the colon : was non always needed after (?, but now it is).

This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the "PCRE v2" syntax, where the (? can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).

I commented, and fixed that in the P304 talk page.

wetneb commented 3 years ago

Thanks! Let's leave this issue open since we can improve how such cases are handled in OpenRefine.

verdy-p commented 3 years ago

Note: the bot that updates the constraint report has still not run. So the statistics are still not updated. So yues we can leave it open for now, until we get fresher statistics. For now we can still check using the "live" SPARQL request.

when I was pinged, I supposed this was caause by [-–] with the en-dash in the character class (I was wrong: so this test did not change the error; I then realized that the syntax (? ... ) which was accepted in the past for non*capturing groups is no longer valid without the colon (?: ... ). i think this occured after the upgrade of PCRE to support extended PCRE (aka "PCRE v2", even if this was supported in PCRE 7.7+ when it added more flags after (?, and even required at least one flag character in [:P'<] or longer "named" flags like (?(DEFINE) ... ): it was then impossible to delimit arbitrary regexps after the opening (? for groups, and the explicit colon became required, or could otherwise generate sometimes validation errors, while in the past it was not necessary and the "non-capturing group" semantic was implied by default.).

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

wetneb commented 3 years ago

@verdy-p thanks for the analysis! That might be worth reporting to the Wikidata team then.

lydiapintscher commented 3 years ago

Does one of you have a regular expression and a test case that shows what doesn't work? Sorry it's a bit hard for me to piece together what the problem is now.

adding @lucaswerkmeister and @addshore

lucaswerkmeister commented 3 years ago

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

There is no regexp validator in Wikidata. format as a regular expression is a string property – for all Wikidata cares, you can put Malbolge code in there and it’ll still be saved. Different users can interpret the values differently – for example, according to the format constraints documentation, KrBot uses PCRE whereas WikibaseQualityConstraints effectively uses java.util.regex.

I’m guessing that OpenRefine should catch the error when parsing the regex, and proceed as if no pattern had been defined on the property (possibly after showing a warning to the user, if possible). Meanwhile, for P304 specifically I suppose there’s no harm, and some gain, in changing the regex on Wikidata from (?…) to (?:…).

(For what it’s worth, WikibaseQualityConstraints currently reports that (?'r'(?'p'(?'d'\d+)(?'w'[A-Za-z]+)?|\g'w'(?:\g'n'\g'w'?)?)(?:[-–]\g'p')?)(?:,\s?\g'r')* is not a valid regular expression.)

wetneb commented 3 years ago

Yes, there is clearly a bug to fix on our side. I am not sure which validation @verdy-p was referring to in this part:

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

If that validation is not in Wikibase itself then we need to figure out where this is coming from! (In that case very sorry for the false alarm, @lydiapintscher and @lucaswerkmeister ).

ostephens commented 3 years ago

@lucaswerkmeister I updated the P304 "format as a regular expression" property over a week ago to fix the problem so it's now (?'r'(?'p'(?'d'\d+)(?'w'[A-Za-z]+)?|\g'w'(?:\g'd'\g'w'?)?)(?:[-–]\g'p')?)(?:,\s?\g'r')*

Is there any reason WikibaseQualityConstraints is still reporting on the older expression?

lucaswerkmeister commented 3 years ago

I don’t follow – I still see a bunch of (?…) in that regex, presumably WBQC is complaining about those. Or is (?'…) special syntax? (I didn’t find anything like it in the Pattern javadoc.)

ostephens commented 3 years ago

Those are subroutine call groups which are supported in PCRE. You can see it validates on regex101(?:[-–]\g%27p%27)?)(?:,\s?\g%27r%27)*&)

The correction I made was a typo in the expression (\g'n' instead of \g'd') which made it invalid whether subroutine call groups are supported or not

verdy-p commented 3 years ago

I don’t follow – I still see a bunch of (?…) in that regex, presumably WBQC is complaining about those. Or is (?'…) special syntax? (I didn’t find anything like it in the Pattern javadoc.)

Have you read the talk page in P304: it is explained there, and yes it is a PCRE feature (named subroutines). The Pattern javadoc is about more limited patterns in Java, not PCRE patterns used in Wikidata (or in a PCRE extender library for Java).

Also care was taken to flag non-capturing groups, instead of using simple parentheses for alternates (they would create captures by default without the flags).

Yes there was a typo in the name 'n' (undefined) instead of 'd' (correctly defined) referenced by a subroutine call, but it was fixed later. Note that recently I added the acceptation of semicolons instead of just commas for lists of page numbers or page ranges)

And nobody ever tried to "ping" me on Wikidata, I would have seen the notification and could have solved it much sooner ! the regexp101 site does accept this PCRE syntax (once the 'n' was fixed into 'd', it was wrong in one of the two copies of regexps shown in properties: one for the base property was correct, the other for the constraint had the wrong subroutine name 'n' undefined instead of 'd' which was defined correctly)

lucaswerkmeister commented 3 years ago

Well, if you want the regex to be understood by WikibaseQualityConstraints and OpenRefine, you’ll need to stick to the common subset of PCRE and java.util.regex.

wetneb commented 3 years ago

Which suggests that we should consider using a different regex engine whose features match Wikibase's better.

verdy-p commented 3 years ago

But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according to the definition of SPARQL by the W3C, which indicates which version of PCRE it matches). Wikibase should use the same choice as SPARQL. But Wikibase and SPARQL both support "named capture groups" the way it was used and described in Wikidata's P304. So the website regexp101 shows that this syntax is valid, and it works. It was correct in Wikidata in one of the properties and in the description, only one Wikidata property used the wrong name 'n' instead of 'd' because of a typo.

Note that I have also added '^...$' anchors to the regexp to avoid partial matches: the regexp must match the full string to avoid false positives (where there could be some number anywhere in the string).

Also I recently added the acceptation of the semicolon like the comma for lists; even if I think that we may avoid comma-separated lists using separate properties (so that each range could have their own "usage" or "applies to" qualifiers instead of accepting padded annotations often in parentheses after a page number such as "(printed edition)" or "(digital edition)", or different paginations depending on the publishing format or reedition with extra illustrations or different design layouts changing the pagination; the regexp does not tolerate such annotations in free text, that also need translation, whilst qualifiers to the property are more convenient and offer better reusability of this data in different contexts of presentation or with languages needing other punctuations or ordering).

With these two additions, the number of violations has been reduced again (and it allowed me to see that some "pages" values contained dates or years, sometimes without even parentheses, so these were ambiguous in terms of data)...

Le ven. 19 févr. 2021 à 12:35, Antonin Delpeuch notifications@github.com a écrit :

Which suggests that we should consider using a different regex engine whose features match Wikibase's better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenRefine/OpenRefine/issues/3274#issuecomment-782020257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSUG3NKLV2JHPHZVVJEWTS7ZEJTANCNFSM4STXZ5LQ .

verdy-p commented 3 years ago

Also note that Wikidata really HAS a validtor for regexps, technically the reference to \g'n' was not syntaxically incorrect, it was just not working because the name 'n' of the subroutine was not defined in the regexp itself before its use by " \g'n' ". Wikidata validates the syntax, but does not check that named subroutines are defined. An in regexp101 a subroutine call to an undefined subroutine name is still valid, but it matches nothing, as expected. You could use \g'dummy' to force a regexp to not match, for example in one of the alternatives, to exclude some forbidden prefixes or some prefixes that are kept to be "reserved" and not to be used in matches. However such thing is not recommended, we should just reference subroutine names that are defined.

All this description was already in the comment I posted in the talk page of P304, describing that feature of PCRE and giving enough hints about how to interpret them, and why it was used (allowing a very compact regexp and generalization).

But the recent addition of the '^ ... $" anchors made the regexp match faster (notably the first anchor) so that it can process many more Wikidata properties by avoiding many partial false positives and retries from other positions to scan the entire value (initially Wikidata used SPARQL by implicitly adding these '^...$' anchors around the specified regexp, this is no longer the case).

Le ven. 19 févr. 2021 à 15:14, Philippe Verdy verdyp@gmail.com a écrit :

But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according to the definition of SPARQL by the W3C, which indicates which version of PCRE it matches). Wikibase should use the same choice as SPARQL. But Wikibase and SPARQL both support "named capture groups" the way it was used and described in Wikidata's P304. So the website regexp101 shows that this syntax is valid, and it works. It was correct in Wikidata in one of the properties and in the description, only one Wikidata property used the wrong name 'n' instead of 'd' because of a typo.

Note that I have also added '^...$' anchors to the regexp to avoid partial matches: the regexp must match the full string to avoid false positives (where there could be some number anywhere in the string).

Also I recently added the acceptation of the semicolon like the comma for lists; even if I think that we may avoid comma-separated lists using separate properties (so that each range could have their own "usage" or "applies to" qualifiers instead of accepting padded annotations often in parentheses after a page number such as "(printed edition)" or "(digital edition)", or different paginations depending on the publishing format or reedition with extra illustrations or different design layouts changing the pagination; the regexp does not tolerate such annotations in free text, that also need translation, whilst qualifiers to the property are more convenient and offer better reusability of this data in different contexts of presentation or with languages needing other punctuations or ordering).

With these two additions, the number of violations has been reduced again (and it allowed me to see that some "pages" values contained dates or years, sometimes without even parentheses, so these were ambiguous in terms of data)...

Le ven. 19 févr. 2021 à 12:35, Antonin Delpeuch notifications@github.com a écrit :

Which suggests that we should consider using a different regex engine whose features match Wikibase's better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenRefine/OpenRefine/issues/3274#issuecomment-782020257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSUG3NKLV2JHPHZVVJEWTS7ZEJTANCNFSM4STXZ5LQ .

lucaswerkmeister commented 3 years ago

But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according to the definition of SPARQL by the W3C, which indicates which version of PCRE it matches).

Where do you get that from? The SPARQL 1.1 REGEX function is defined in terms of XPath fn:matches, using XPath and XQuery regular expression syntax, which in turn is based on XML Schema regular expressions. None of these standards mention PCRE, as far as I can tell; the latter two have a few mentions of Perl, but only of an informational nature (e.g. based on the established conventions of languages such as Perl).

But that’s kind of beside the point, since Blazegraph doesn’t implement XPath regexes – it uses java.util.regexPattern class. You can test this e.g. using the pattern \p{javaLowerCase} (Virtuoso confirms that’s not a valid SPARQL regex).

Note that I have also added '^...$' anchors to the regexp to avoid partial matches: the regexp must match the full string to avoid false positives (where there could be some number anywhere in the string).

The format is already interpreted to cover the entire string, at least by WikibaseQualityConstraints (and I believe by KrBot as well). Adding explicit anchors to the pattern is unnecessary.

Also note that Wikidata really HAS a validtor for regexps

Where? And if it exists (I’m pretty sure it doesn’t), why did it let me save (((\\\(\((\ as a format?

verdy-p commented 3 years ago

May be it has been disabled, for you but I have a validation running that blocks such insertion when submitting. At least it checks that enclosing delimiters are correctly paired May be it does not run for everyone or there's a setting.

As well using unanchored regexps gives more false positives with partial matches (and slower execution of the parser, giving more timeout errors). So anchors are visibly needed. Wikidata should work on stabilizing the regexp engines it uses (including in its tools used in the documentation and existing scheduled parsers and reports).

Le ven. 19 févr. 2021 à 16:14, Lucas Werkmeister notifications@github.com a écrit :

But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according to the definition of SPARQL by the W3C, which indicates which version of PCRE it matches).

Where do you get that from? The SPARQL 1.1 REGEX function https://www.w3.org/TR/sparql11-query/#func-regex is defined in terms of XPath fn:matches, using XPath and XQuery regular expression syntax https://www.w3.org/TR/xpath-functions/#regex-syntax, which in turn is based on XML Schema regular expressions https://www.w3.org/TR/xmlschema-2/#regexs. None of these standards mention PCRE, as far as I can tell; the latter two have a few mentions of Perl, but only of an informational nature (e.g. based on the established conventions of languages such as Perl).

But that’s kind of beside the point, since Blazegraph doesn’t implement XPath regexes – it uses java.util.regex’ Pattern class. You can test this e.g. using the pattern \p{javaLowerCase} https://query.wikidata.org/#SELECT%20%28REGEX%28%22a%22%2C%20%22%5C%5Cp%7BjavaLowerCase%7D%22%29%20AS%20%3Fq%29%20%7B%7D (Virtuoso https://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%28REGEX%28%22a%22%2C+%22%5C%5Cp%7BjavaLowerCase%7D%22%29+AS+%3Fq%29+%7B%7D&format=text%2Fhtml&timeout=30000&signal_void=on&signal_unconnected=on confirms that’s not a valid SPARQL regex).

Note that I have also added '^...$' anchors to the regexp to avoid partial matches: the regexp must match the full string to avoid false positives (where there could be some number anywhere in the string).

The format is already interpreted to cover the entire string, at least by WikibaseQualityConstraints https://github.com/wikimedia/mediawiki-extensions-WikibaseQualityConstraints/blob/ab49f8f604c3ef1d7253833031822ddb639f849f/src/ConstraintCheck/Helper/SparqlHelper.php#L634 (and I believe by KrBot as well). Adding explicit anchors to the pattern is unnecessary.

Also note that Wikidata really HAS a validtor for regexps

Where? And if it exists (I’m pretty sure it doesn’t), why did it let me save (((\(((\ as a format https://www.wikidata.org/w/index.php?title=Q4115189&diff=1365241122&oldid=1365048428 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenRefine/OpenRefine/issues/3274#issuecomment-782136896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSUG2KBTNMGHMBVYHBAFDS7Z56LANCNFSM4STXZ5LQ .

emu-at commented 3 years ago

So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine? That’s really sobering … Is there any possibility to add an option in OpenRefine that circumvents the regex check altogether? I mean, it’s not really a problem for Wikidata per se because you can add all sorts of unchecked data through QuickStatements anyway …

ostephens commented 3 years ago

So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine?

No I don't think that's the case. As @wetneb mentions right at the top of this thread:

Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

While it might also be appropriate for OpenRefine to look at alternative approaches for regular expression parsing, I think the key thing in this case is for there to be a graceful failure when encountering regular expressions that can't be parsed for some reason. This should (in my opinion) include telling the user that the condition can't be checked, but not prevent the user from uploading the data in this case (note there is a difference here between data that definitely doesn't match a regular expression and data that cannot be checked against a regular expression - I'd suggest we are strict with the former and relaxed with the latter)

wetneb commented 3 years ago

This issue is really easy to solve as far as OpenRefine is concerned: just by ignoring regexes which our regex engine cannot parse.

verdy-p commented 3 years ago

As far as I can tell, the only thing your engine does not support is the PCR subroutines syntax, which is used here only as a "macro"-like facility (i.e. without recursion), so an equivalent is easy to produce (and in fact the transform from an expanded form to the compressed form using soubroutines is straightforward and fully described in the talk page. It was used to "compress" the regexp in a shorter form, which is then easier to maintain.

Supporting basic subroutines that are "macro-like" (i.e. not containing any selfrecrusion) is straightforward in the regexp engine, notably because I had also taken car of using non-capturing groups with "(?: ... )", so there are no side effects on the capture-group numbering, and there are as well no self-recursion "\1" ... "\9" to numbered capturing groups in the search pattern (which are ALSO subroutines except they are "named" implicitly with numbers assigned to unflagged "( ... )" capture groups instead of symbolic names).

For some more complex regexps elsewhere (e.g. OpenRefine) it is needed to define and use these subroutines, due to length constraints in Wikidata, but also because of side effects on capture-group numbering (it would force all these common subpatterns to flag internal non-capturing groups explicitly, and rapidly the regexp explodes in length, which become longer than Wikidata length constraint for a single property (and then maintaining this long regexp become difficult to edit for humans).

Named subroutines in regexp is a very great addition in PCRE (and some other engines sometimes using an alternate syntax, noting that PCRE also supports their alternate equivalent syntax like "?DEFINE:" or using other delimiters for the subroutine name such as "").

Wikidata has full support of PCRE in its implemented tools. Other external tools may need to see how they can "tranlate" the regexp engine dialects, but PCRE is the best documented and supported engine and one of the fastest, also with full support of Unicode (not limited to 7-bit or 8-bit character classes like many basic engines not suitable at all in Wikidata).

verdy-p commented 3 years ago

Note: is there a possibility for Wikidata to indicate which regexp syntax (possibly with a version for PCRE) it requires, and allow providing alternate regexp syntaxes demanding less features or using alternate syntax, even if they don't validate the value as strictly as the prefered one used in Wikidata's own builtin tools, or are more memory-intensive when compiling or less performant while scanning, and more difficult to maintain by humans?

lucaswerkmeister commented 3 years ago

Wikidata has full support of PCRE in its implemented tools.

Which tools, specifically, are you talking about? Because, again, neither WikibaseQualityConstraints nor the Wikidata Query Service support PCRE.

verdy-p commented 3 years ago

Well all validation tools and bots running and serving quality reports on Wikidata support these named subroutines (I've made multiple runs to show that they are effective and don't generate any additional errors when processing, or false positives, or false negatives if these subroutines were not working, and they DO have the same syntax as PCRE, even if PCRE is not the engine effectively used).

I've tested them with the existing constraint templates used at top of talk pages for wikidata elements, and posting link to the scheduled generated reports, or showing links to run a SPARQL request (this works best with the "new" SPARQL link, which is much faster) and return results (cached for some hours) in a browable datatable.

tfmorris commented 3 years ago

Please move the Wikidata discussion someplace more appropriate. Let's keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb said, is as simple as adding a try/catch block, thus the good first issue label . All this extraneous discussion is just going to confuse/discourage newcomers who are trying to figure out what they need to code for a pull request.

verdy-p commented 3 years ago

But this discussion started really with a bug request on Wikidata (even if this causes a break on OpenRefine). So it was in topic. And I still don't see why and what is broken in Wikidata itself when we can conclude that this is a problem of an external tool using a different regexp engine than those working on Wikidata and its local supported tools for data validation.

As OpenData is independant of Wikidata (and has a very different proprietary origin from past Google Refine before it was open sourced), and is supposed to be used to validate many other data sources (including but not only Wikidata), the fact that it still does not support PCRE's syntax for named capture groups and subroutines is a bug of OpenRefine, which is not as "universal" as it should be. It will need work to make it use a more powerful regexp engine (PCRE or another), and support to possibly translate several regexp dialects (like on the "regexp101.com" website).

OpenRefine is supposed to use plugins/extensions to support custom validators, but supporting PCRE should be integrated by default (because it is one of the best performing regexp library used in so many softwares, and even integrated in several programming languages, but also because it leads the world in term of reference for all advanced features and in terms of performance and stability).

You need advance support for example to cleanup and normalize a database of customer addresses or phone numbers (with lot of conventional formats depending on countries or regions). Basic regexps supported in the standard GREL extension of OpenRefine is clearly insufficient for such cases, unless you force many databases to be excessively cleaned up by rejecting valid data or transforming it in an ambiguous or lossy way (thus making your database or datasource much less valuable).

Sometimes it's even best to not validate such data and allow free form (with just minimal cleanup such as whitespace compression like in XML, and not even any attempt to change the letter case, or attempts to remove diacritics and format controls for texts in a language you don't really understand but that should then be left "as is", if you don't have advance support of Unicode): you may attmpt to normalize what you undertand, and add some flag data in your database for things that are left in free form as exceptions to be handled manually or validated by other tools.

Alternatives to PCRE are possible (such as ICU, which also has its own engine with excellent and reference support for safe Unicode text handling); the old legacy regexp engine in Java JRE is also no longer a reference, it works "as is" with its limitations, just like the very limited "pattern" engine in Lua which is even more limited (no support for alternatives, and very basic support for Unicode with only a few character properties, just extended from legacy POSIX character classes, e.g. in the "mw.ustring" Lua package, built into the Scribunto extension for Mediawiki: it also works yes, but it is very slow, very memory- and CPU intensive, and lacks many features).

Those legacy engines are actually not meant to handle large volume of data, work poorly today in a very internationalized environment; they exist only to validate small input forms, with active interaction with the final user so that they can solve what may be blocking their input in a meaning ful way (in a better way than what an automated cleaner would propose, with data truncation, ambiguous abbreviations or false reinterpretations; this is the same situation as when using an automatic translator, or the service of an actual human translation, one solution is cheap but has low quality results, it solves less problems than a more universal solution, and sometimes it may also much slower than modern solutions with useful extnesions now well supported to support "border" cases not handled correctly by legacy solutions).

Le sam. 27 févr. 2021 à 20:14, Tom Morris notifications@github.com a écrit :

Please move the Wikidata discussion someplace more appropriate. Let's keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb https://github.com/wetneb said, is as simple as adding a try/catch block, thus the good first issue label All this extraneous discussion is just going to confuse/discourage newcomers () who are trying to figure out what they need to code for a pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenRefine/OpenRefine/issues/3274#issuecomment-787121340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSUG32IRKP3243ZROZGXDTBFABZANCNFSM4STXZ5LQ .