DOI parsing fails when DOI is not followed by whitespace

ketch commented 10 years ago

The regex for DOI references can fail if the DOI is not followed by whitespace. This happens, for instance, if it is at the end of an html line (so it is immediately followed by
).

Currently, this doesn't break anything. The error must get corrected somewhere in the DOI lookup.

cjlee112 commented 10 years ago

It seems to me that DOI recognition is turning into a bit of a nightmare, because it consists of TWO incompatible standards:

pre-2009: "Legal characters are the legal graphic characters of Unicode", extremely problematic because strictly speaking there no way to know where the DOI ends (any character allowed, so no character can be treated as terminator).
post-2009: "CrossRef limits a DOI to only using the following characters: "a-z", "A-Z", "0-9" and "-._;()/" This rule was put into effect on January 1, 2009 and affects only the creation of new DOIs".

The problem is that when reading a string that starts doi:... there's no way to know whether it's going to be a pre or post-2009 DOI. This creates an impossible dilemma:

if you terminate according to post-2009 rules, but the DOI is actually pre-2009, you'll terminate too early and get only a fragment of the correct DOI.
if you allow any characters not in the post-2009 character set, but the DOI is actually post-2009, you'll append extra crap that isn't supposed to be part of the DOI.

Conundrums:

There's no way to resolve this other than just first trying the post-2009 rule, and if that fails, try some heuristic for terminating the (in theory unterminatable) pre-2009 DOI format. And "try" means having to query external databases (e.g. crossref) to see if the putative DOI exists or not. Possible but seriously klugey.
Is there any good heuristic for terminating pre-2009 DOIs? Given how DOIs are displayed on papers, journals, etc., whitespace seemed like a good terminator. But then G+ stuck an HTML tag immediately after a DOI (with no whitespace), breaking that rule. And unfortunately the HTML tag characters < > are legal in pre-2009 DOIs... Is there any way to find out whether any publisher actually used < in a DOI?
there is one ironclad way to escape this whole set of dilemmas, and it's one that the original spnet design strongly urged on people: give the shortDOI instead of the DOI. This solves the whole problem. But will people do it? I think we should strongly urge people to do this, and warn them that if they don't, recognizing their DOI could fail for many possible reasons (link to page explaining all these technical complications in case they really want to know).
Ideally we could tell users to paste pre-2009 DOIs as URLs, since all the above issues would then be taken care of by urlencoding (i.e. we use URL character rules to find the termination of the URL, the de-urlencode to get the DOI). But it's too much to expect all users to learn and follow this special rule.
conceivably we could tell users they HAVE to give all DOIs as URLs... but I think they'd find this incomprehensible and infuriating. "I provided the right DOI, why won't spnet process it?!"

One more thought: we need a way to report to users that spnet hit a problem while trying to index their post. Right now people get frustrated because their post vanishes into a black hole (doesn't get indexed) and they have no way to know why. Unfortunately we have no direct way to communicate with them -- can't communicate with them on G+ (G+ gives no mechanism to do that; can't even send them email (we can look to see whether G+ gives us a user email address, I guess). Indeed no way even to know whether they'll log in to our site.

cjlee112 commented 10 years ago

Continuing thought on how to communicate with G+ users: I guess we could automatically generate a list of "problem reports", then paste those as comments on the original G+ posts that had a problem.

cjlee112 commented 10 years ago

@ketch Another thought on conundrum 1: this creates a strong coupling between A. actually retrieving a paper by its DOI, and B. deciding that a particular string in the post is actually a DOI. The problem is that the only test of B is to do A. This has implications for the indexing design. For example, I liked your design change to separate the paper ID recognition from actually getting the paper. However this conundrum threatens that design change... unless we can find a bulletproof way of extracting DOIs from text.

cjlee112 commented 10 years ago

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

semorrison commented 10 years ago

I have a list of all DOIs in (the top 100 or so) math journals. I'll go look for weird characters. My memory is that only World Scientific had <. My suggestion would just be to fail to parse those, and declare it a feature, not a bug.

On Fri, Oct 18, 2013 at 7:36 AM, cjlee112 notifications@github.com wrote:

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/98#issuecomment-26550268 .

cjlee112 commented 10 years ago

Hmm, @ketch linked to a SO discussion that reports a variety of DOIs (e.g. Wiley) use < > http://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page

cjlee112 commented 10 years ago

Now I'm wondering what happens if a user types a < or > sign in a G+ post, tweet etc. Clearly G+ will have to re-encode the < or > so it won't muck with their HTML display of the post. So right out of the gate we know that DOIs containing < > are going to get stomped on (no longer match the string that the user pasted in a post) by G+, Twitter etc. On the one hand this is good news: we can terminate on < >. On the other hand it's bad news: we can only handle these DOIs reliably if we can find a universal de-munging function that will recover the original DOI string from whatever changes G+, Twitter and the rest of the universe applied.

cjlee112 commented 10 years ago

We're going to have to test carefully our recovery of DOIs from URLs. When a user pastes a DOI right after "http://dx.doi.org/" and submits that as part of a G+ post, or a WordPress post or whatever, can we be assured that the post content will give us a urlencoded string that's guaranteed to back-transform to the original DOI string? Hopefully yes, but we should code a lot of unit tests that verify this, e.g. all the hard DOIs in the SO post.

cjlee112 commented 10 years ago

Ouch. I just checked what happens when a user tries to paste a DOI after http://dx.doi.org/: G+ screws it up. Specifically, for a DOI that contains a <, G+ makes a link but only for the part up to the <. Gack -- I would have assumed they'd apply urlencoding! So actually users won't be able to paste DOIs into G+ posts the way we were hoping. Sigh. Here's the example: https://plus.google.com/b/111368023899233259117/111368023899233259117

semorrison commented 10 years ago

Many Springer journals, e.g. "Journal of Algebraic Combinatorics" and "The Ramanujan Journal", "Algebra Logika", "Sibirsk. Mat. Zh.", and "Geometriae Dedicata" use DOIs with a colon, e.g.

http://dx.doi.org/10.1023/B:ALLO.0000048828.44523.94 http://dx.doi.org/10.1023/B:SIMJ.0000048923.81718.a5 http://dx.doi.org/10.1023/B:GEOM.0000049122.75284.06 http://dx.doi.org/10.1023/A:1022433314190

The "Journal of the London Mathematics Society", and some others from OUP, also, e.g.

http://dx.doi.org/10.1093/acprof:oso/9780199534920.003.0012

"Logical Methods in Computer Science" uses parentheses and colons:

http://dx.doi.org/10.2168/LMCS-6(4:3)2010

"Mathematical Modelling of Natural Phenomena":

http://dx.doi.org/10.1051/mmnp:2008041

Many journals Wiley journals (sorry, I'd misremembered this as World Scientific earlier), including "Communications on Pure and Applied Mathematics", "Numerical Linear Algbra with Applications", "Journal of Graph Theory", "Journal of Combinatorial Designs" used to use DOIs with all sorts of characters: parentheses, colons, semicolons, and hashes, e.g.

http://dx.doi.org/10.1002/(SICI)1099-1506(199603/04)3:2173::AID-NLA69 3.3.CO;2-3

On Fri, Oct 18, 2013 at 7:59 AM, Scott Morrison scott@tqft.net wrote:

I have a list of all DOIs in (the top 100 or so) math journals. I'll go look for weird characters. My memory is that only World Scientific had <. My suggestion would just be to fail to parse those, and declare it a feature, not a bug.

On Fri, Oct 18, 2013 at 7:36 AM, cjlee112 notifications@github.comwrote:

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/98#issuecomment-26550268 .

cjlee112 / spnet

DOI parsing fails when DOI is not followed by whitespace #98