CottageLabs / idfind

An identifier identifier
1 stars 0 forks source link

regex identification - weird problems #27

Closed emanuil-tolev closed 12 years ago

emanuil-tolev commented 12 years ago

When identifying an identifier using the tests stored in the index, IDFind seems to fail rather strangely.

If we take 10.1186/1758-2946-3-47, the example from the front page, this is successfully matched by the ^((http:\/\/){0,1}dx.doi.org/|(http:\/\/){0,1}hdl.handle.net\/|doi:|info:doi:){0,1}(?P10..+\/.+) regex we wrote at DevXS.

If we then try to match 10.1186/1758-2946-3-49 (change last digit to 9) that works too. However, if we then try to match 10.1186/1755-2946-3-49 (change 1758 to 1755, so just 1 digit) - this fails!

This shouldn't happen according to my reading of the regex which allows a .+ at that point... so changing the digit from 8 to 5 shouldn't cause it to not match.

I'll try to investigate this later by just compiling this regex with re in the REPL and try to figure out what's wrong.

emanuil-tolev commented 12 years ago

Wow, my mind is surprisingly vacant at certain times. The regex identification works just fine - the problem is that we've ALSO got a feature which, given a URL, will check whether the identifier is actually what we think it is.

E.g. the Digital Object Identifier regex may match a given identifier (such as both examples above), but only ONE of them is ACTUALLY an assigned DOI. If you try resolving them both at http://dx.doi.org/, you will find that 10.1186/1758-2946-3-49 resolves fine. Changing an arbitrary digit (the 8 in the 1758 to a 5) results in a technically valid, but non-assigned DOI: 10.1186/1755-2946-3-49.

So our service (correctly) decides that string is NOT a DOI.

Well, good to affirm that feature's working properly...