regex identification - weird problems

When identifying an identifier using the tests stored in the index, IDFind seems to fail rather strangely.

If we take 10.1186/1758-2946-3-47, the example from the front page, this is successfully matched by the ^((http:\/\/){0,1}dx.doi.org/|(http:\/\/){0,1}hdl.handle.net\/|doi:|info:doi:){0,1}(?P10..+\/.+) regex we wrote at DevXS.

If we then try to match 10.1186/1758-2946-3-49 (change last digit to 9) that works too. However, if we then try to match 10.1186/1755-2946-3-49 (change 1758 to 1755, so just 1 digit) - this fails!

This shouldn't happen according to my reading of the regex which allows a .+ at that point... so changing the digit from 8 to 5 shouldn't cause it to not match.

I'll try to investigate this later by just compiling this regex with re in the REPL and try to figure out what's wrong.

Wow, my mind is surprisingly vacant at certain times. The regex identification works just fine - the problem is that we've ALSO got a feature which, given a URL, will check whether the identifier is actually what we think it is.

E.g. the Digital Object Identifier regex may match a given identifier (such as both examples above), but only ONE of them is ACTUALLY an assigned DOI. If you try resolving them both at http://dx.doi.org/, you will find that 10.1186/1758-2946-3-49 resolves fine. Changing an arbitrary digit (the 8 in the 1758 to a 5) results in a technically valid, but non-assigned DOI: 10.1186/1755-2946-3-49.

So our service (correctly) decides that string is NOT a DOI.

Well, good to affirm that feature's working properly...

CottageLabs / idfind

regex identification - weird problems #27