cjlee112 / spnet

selected papers network web engine
http://thinking.bioinformatics.ucla.edu/2011/07/02/open-peer-review-by-a-selected-papers-network/
GNU General Public License v2.0
40 stars 11 forks source link

DOI parsing fails when DOI is not followed by whitespace #98

Open ketch opened 10 years ago

ketch commented 10 years ago

The regex for DOI references can fail if the DOI is not followed by whitespace. This happens, for instance, if it is at the end of an html line (so it is immediately followed by
).

Currently, this doesn't break anything. The error must get corrected somewhere in the DOI lookup.

cjlee112 commented 10 years ago

It seems to me that DOI recognition is turning into a bit of a nightmare, because it consists of TWO incompatible standards:

The problem is that when reading a string that starts doi:... there's no way to know whether it's going to be a pre or post-2009 DOI. This creates an impossible dilemma:

Conundrums:

One more thought: we need a way to report to users that spnet hit a problem while trying to index their post. Right now people get frustrated because their post vanishes into a black hole (doesn't get indexed) and they have no way to know why. Unfortunately we have no direct way to communicate with them -- can't communicate with them on G+ (G+ gives no mechanism to do that; can't even send them email (we can look to see whether G+ gives us a user email address, I guess). Indeed no way even to know whether they'll log in to our site.

cjlee112 commented 10 years ago

Continuing thought on how to communicate with G+ users: I guess we could automatically generate a list of "problem reports", then paste those as comments on the original G+ posts that had a problem.

cjlee112 commented 10 years ago

@ketch Another thought on conundrum 1: this creates a strong coupling between A. actually retrieving a paper by its DOI, and B. deciding that a particular string in the post is actually a DOI. The problem is that the only test of B is to do A. This has implications for the indexing design. For example, I liked your design change to separate the paper ID recognition from actually getting the paper. However this conundrum threatens that design change... unless we can find a bulletproof way of extracting DOIs from text.

cjlee112 commented 10 years ago

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

semorrison commented 10 years ago

I have a list of all DOIs in (the top 100 or so) math journals. I'll go look for weird characters. My memory is that only World Scientific had <. My suggestion would just be to fail to parse those, and declare it a feature, not a bug.

On Fri, Oct 18, 2013 at 7:36 AM, cjlee112 notifications@github.com wrote:

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/98#issuecomment-26550268 .

cjlee112 commented 10 years ago

Hmm, @ketch linked to a SO discussion that reports a variety of DOIs (e.g. Wiley) use < > http://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page

cjlee112 commented 10 years ago

Now I'm wondering what happens if a user types a < or > sign in a G+ post, tweet etc. Clearly G+ will have to re-encode the < or > so it won't muck with their HTML display of the post. So right out of the gate we know that DOIs containing < > are going to get stomped on (no longer match the string that the user pasted in a post) by G+, Twitter etc. On the one hand this is good news: we can terminate on < >. On the other hand it's bad news: we can only handle these DOIs reliably if we can find a universal de-munging function that will recover the original DOI string from whatever changes G+, Twitter and the rest of the universe applied.

cjlee112 commented 10 years ago

We're going to have to test carefully our recovery of DOIs from URLs. When a user pastes a DOI right after "http://dx.doi.org/" and submits that as part of a G+ post, or a WordPress post or whatever, can we be assured that the post content will give us a urlencoded string that's guaranteed to back-transform to the original DOI string? Hopefully yes, but we should code a lot of unit tests that verify this, e.g. all the hard DOIs in the SO post.

cjlee112 commented 10 years ago

Ouch. I just checked what happens when a user tries to paste a DOI after http://dx.doi.org/: G+ screws it up. Specifically, for a DOI that contains a <, G+ makes a link but only for the part up to the <. Gack -- I would have assumed they'd apply urlencoding! So actually users won't be able to paste DOIs into G+ posts the way we were hoping. Sigh. Here's the example: https://plus.google.com/b/111368023899233259117/111368023899233259117

semorrison commented 10 years ago

Many Springer journals, e.g. "Journal of Algebraic Combinatorics" and "The Ramanujan Journal", "Algebra Logika", "Sibirsk. Mat. Zh.", and "Geometriae Dedicata" use DOIs with a colon, e.g.

http://dx.doi.org/10.1023/B:ALLO.0000048828.44523.94 http://dx.doi.org/10.1023/B:SIMJ.0000048923.81718.a5 http://dx.doi.org/10.1023/B:GEOM.0000049122.75284.06 http://dx.doi.org/10.1023/A:1022433314190

The "Journal of the London Mathematics Society", and some others from OUP, also, e.g.

http://dx.doi.org/10.1093/acprof:oso/9780199534920.003.0012

"Logical Methods in Computer Science" uses parentheses and colons:

http://dx.doi.org/10.2168/LMCS-6(4:3)2010

"Mathematical Modelling of Natural Phenomena":

http://dx.doi.org/10.1051/mmnp:2008041

Many journals Wiley journals (sorry, I'd misremembered this as World Scientific earlier), including "Communications on Pure and Applied Mathematics", "Numerical Linear Algbra with Applications", "Journal of Graph Theory", "Journal of Combinatorial Designs" used to use DOIs with all sorts of characters: parentheses, colons, semicolons, and hashes, e.g.

http://dx.doi.org/10.1002/(SICI)1099-1506(199603/04)3:2173::AID-NLA69 3.3.CO;2-3

On Fri, Oct 18, 2013 at 7:59 AM, Scott Morrison scott@tqft.net wrote:

I have a list of all DOIs in (the top 100 or so) math journals. I'll go look for weird characters. My memory is that only World Scientific had <. My suggestion would just be to fail to parse those, and declare it a feature, not a bug.

On Fri, Oct 18, 2013 at 7:36 AM, cjlee112 notifications@github.comwrote:

Seems like we have to just impose a limit on pre-2009 DOIs, e.g. no whitespace or HTML tags, like you did. This seems reasonable in the context of text documents (and HTML in particular).

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/98#issuecomment-26550268 .