cjlee112 / spnet

selected papers network web engine
http://thinking.bioinformatics.ucla.edu/2011/07/02/open-peer-review-by-a-selected-papers-network/
GNU General Public License v2.0
40 stars 11 forks source link

More flexible paper ID #23

Open fgdorais opened 11 years ago

fgdorais commented 11 years ago

It's much easier to cut and paste a full paper url than manually formatting a proper ID. I would like to be able to use #spnetwork http://arxiv.org/abs/1234.6789 (and variants) as an equivalent for the proper #spnetwork arXiv:1234.6789.

cjlee112 commented 11 years ago

Sadly, this is one of those good intentions that paves the way to hell. When you set a clear standard, everybody at least knows what to do; better still, when you use a pre-existing standard, everybody already knows what to do. But if we drop that standard and instead say "we try to recognize whatever URLs people stick in", trouble starts. Your innocuous little phrase "(and variants)" in practice means that people will expect us to read their minds and will be mad when we guess wrong (which will happen very often). Consider just a few of the infinite scenarios:

cjlee112 commented 11 years ago

OK, I have a "modest proposal": the only URL that we will automatically recognize is the selectedpapers.net URL for the paper (i.e. https://selectedpapers.net/arxiv/1234.5678). That's the only way I can think of to eliminate the slippery slope of "(and variants)" that will inevitably occur if we allow any other URL.

Would you consider that an improvement, worth implementing?

fgdorais commented 11 years ago

That's not what I meant: the variants are http://arxiv.org/abs/1234.6789, http://arxiv.org/pdf/1234.6789v2.pdf, http://arxiv.org/ps/1234.6789v2,...

fgdorais commented 11 years ago

[Sorry, comments are out of order: this is a reply to the second cjlee112 comment; the above is a reply to the first cjlee112 comment]

Sure, you should recognize the selectedpapers.net urls, that's natural. The point is that (as I did on my first attempt) I wrote a G+ post linking to http://arxiv.org/abs/1234.6789 and then added #spnetwork in front. That was the natural thing to do for me but it didn't work. I wouldn't expect you to parse an url that doesn't immediately follow #spnetwork and I wouldn't expect you to parse an url other than the standard arxiv.org and dx.doi.org url formats.

cjlee112 commented 11 years ago

people are going to copy what they see other people doing (and they aren't going to read the manual). If they see someone else stick in a URL, they'll think they can stick in any URL, and then they'll be mad when it doesn't work. So allowing URLs in the standard #spnetwork tagging is sure to become a "slippery slope" that leads to user errors, user anger, and pain all around. I know there are users who would employ it correctly, but the slippery slope will kill us with other users.

The one place where I can imagine parsing arXiv URLs is from old blog content:

Maybe we should create a new issue for "automatic indexing of blog archives". And meanwhile close this issue.

fgdorais commented 11 years ago

I'm not convinced by the "slippery slope" argument but I think the right course of action for this will be visible after the posting interface is improved #44.

fgdorais commented 10 years ago

I just want to resurrect this one with an example here:

https://plus.google.com/110765980098077923527/posts/2HUgnh5owJh

I still think the full arxiv.org urls should be supported (and so should dx.doi.org urls). The main reason is that it is a natural thing to use full canonical urls.

(This is tangentially related to #93.)

semorrison commented 10 years ago

I'm with François. I don't really see the slippery slope argument here. There's a "central area" of plausible identifier schemes (e.g. everything Andrew Stacey tried in the post linked above), and then a long tail of stupid things. On MathOverflow, we just grepped and grepped and grepped our database looking for more variants of arXiv identifiers until we got bored looking for more variations, and I don't really regret it. People are not going to learn solely from seeing others, but will often do whatever is easiest. What is easiest is invariably copying and pasting a URL from another tab. That means accepting http://arxiv.org/abs/ and http://arxiv.org/pdf/ URLs is pretty important.

cjlee112 commented 10 years ago

@fgdorais @semorrison I hear you. A couple thoughts:

So what are you proposing? That we tell different users different guidelines? Arxiv users, don't bother telling us the official arXiv ID, but just paste a URL... Everybody else, you must supply DOI in the format DOI:{DOI} or pubmed ID in PMID:{PMID} ?

Another possibility: I could imagine pursuing two distinct strategies at the same time:

Any thoughts on this?

cjlee112 commented 10 years ago

@fgdorais RE: "the standard arxiv.org and dx.doi.org url formats". I hate to disappoint you, but there's a world of difference between arxiv.org URLs (manageable) vs. DOIs (ugly mess). This is one of those places where a seemingly simple idea breaks down in practice due to technical details in the implementation. Example: the DOI format has almost no rules (each DOI issuer can make up whatever mad rules they want); in particular, DOIs frequently contain characters that are illegal in URLs. Hence, strictly speaking a dx.doi.org URL for such a DOI cannot use the DOI verbatim. Instead it has to be transformed to a "URL encoded" form that replaces those characters with codes. Valid DOIs can themselves contains those codes... So when you see such a code in a dx.doi.org URL, you should probably reverse transform it... but on the other hand maybe you shouldn't? $*($%! The whole idea of a unique ID is it "just works". This does not "just work".

fgdorais commented 10 years ago

@cjlee112 I don't understand the problem. We should do exactly what the dx.doi.org servers do. I don't see what they would do other than URL decoding everything to get the intended DOI.

marcharper commented 10 years ago

I don't see why we shouldn't try to understand user input as best we can, even if it's not in the preferred syntax. Roughly in order of complexity, we can in principle, for any post with #spnetwork (and possibly other tags) lacking an ID reference:

1) Look for an ArXiv URLs specifically; if there is only one, assume that the post is about that paper, and use a regex to extract the ID 2) Look for other special URLs, e.g. possibly shortDOI refs or pubmed links, follow the URL, and look for a unique DOI on that page (again with say a few regexes, e.g. maybe as simple as "doi:*?\s") 3) If there are any other URLs, and we've yet to find an identifier, follow them (up to some reasonable limit) and look for DOIs. If we find a unique DOI reference, it's probably the paper. Such URLs could be from URL shorteners (common on Twitter of course) or uncommon journals that we don't know where to look for the DOI, etc.

From there, we can flag the submission as having an inferred / deduced identifier, just in case we occasionally pick up a bad ID, and so the user knows that they are not using the preferred syntax, and others can report it if mis-indexed. We could also find the "right" URL from the post itself -- some users seem to be linking specifically to the paper in G+ (I assume we get this from the post).

This would seem to cover a huge proportion of the current use cases, and users could simply

spnetwork #recommend #other_grammatical_tags URL,

or put the URL elsewhere in the post, without needing to know the preferred syntax for paper identification.

Is there a substantial downside?

semorrison commented 10 years ago

The issue of illegal characters in DOIs is unfortunate, but not actually that bad in practice. I've done various bits of automatic processing over a very large sample of mathematics papers DOIs, and the only really disastrous case was World Scientific sometimes using < and > in their DOIs. From memory, it was only a small (older, less interesting) subset of World Scientific articles anyway.

Insofar are mathematics is concerned, I think we can pretend things just work.

scott

On Oct 9, 2013, at 6:48, cjlee112 notifications@github.com wrote:

@fgdorais https://github.com/fgdorais RE: "the standard arxiv.org and dx.doi.org url formats". I hate to disappoint you, but there's a world of difference between arxiv.org URLs (manageable) vs. DOIs (ugly mess). This is one of those places where a seemingly simple idea breaks down in practice due to technical details in the implementation. Example: the DOI format has almost no rules (each DOI issuer can make up whatever mad rules they want); in particular, DOIs frequently contain characters that are illegal* in URLs. Hence, strictly speaking a dx.doi.org URL for such a DOI cannot use the DOI verbatim. Instead it has to be transformed to a "URL encoded" form that replaces those characters with codes. Valid DOIs can themselves contains those codes... So when you see such a code in a dx.doi.org URL, you should probably reverse transform it... but on the other hand maybe you shouldn't? $($%! The whole idea of a unique ID is it "just works". This does not "just work".

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-25921471 .

cjlee112 commented 10 years ago

@semorrison @fgdorais @marcharper If you didn't believe in the "slippery slope", just look at how fast the proposed goal is being expanded with each person who joins the conversation. Now we're supposed to reverse-engineer DOIs from random URLs, shortened URLs etc.? That means everything that looks like a URL we're supposed to follow and screen-scrape, scan the resulting HTML (robots.txt violations, anybody?), and if some part of this process fails to divine what the user actually wanted, then it's considered a selectedpapers.net bug (user concludes "selectedpapers.net doesn't work")?

It doesn't make sense to promulgate a policy that doesn't work, i.e. user follows the policy, but spnet still can't figure out what paper he wanted to cite. If we did that, people would have every justification to be mad at us. We should only issue a policy if we know that it's technically implementable, i.e. "user followed the policy" = "spnet correctly indexed the paper citation". And then we should fix any errors where our implementation doesn't follow the policy perfectly.

On the other hand, I suspect we need a secondary, "search engine indexing" process whose goal is simply "do your best to recognize paper URLs" perhaps using the kind of tricks Marc described. I think such an indexing effort is probably unavoidable, because most people are unaware of selectedpapers.net. As I outlined in my previous comment, such indexing could be displayed on selectedpapers.net in a truncated "search engine results" format (i.e. not full-text). However, because such a process is highly imperfect, we must not recommend it to users as "here's how to write your posts" for full-text display on selectedpapers.net.

semorrison commented 10 years ago

You make a good point about the slippery slope! Pretty obviously, screen-scraping (and even resolving short URLs) is out of the question.

I'm still strongly in favour of recognizing more rather than less. The last couple of #spnetwork tags I've seen in my Google+ stream have basically turned into "How, exactly, am I meant to do this?" threads?

For the arxiv, I propose we recognize precisely these:

an arxiv identifier is one of the following four forms:

[0-9]{8} [0-9]{8}v[0-9]+ [A-Za-z.]/[0-9]{7} [A-Za-z.]/[0-9]{7}v[0-9]+

an arxiv URL is one of the following four forms, where {id} is an identifier as above

arxiv:{id} arXiv:{id} http://arxiv.org/abs/{id} http://arxiv.org/pdf/{id}

On Wed, Oct 9, 2013 at 9:30 AM, cjlee112 notifications@github.com wrote:

@semorrison https://github.com/semorrison @fgdoraishttps://github.com/fgdorais @marcharper https://github.com/marcharper If you didn't believe in the "slippery slope", just look at how fast the proposed goal is being expanded with each person who joins the conversation. Now we're supposed to reverse-engineer DOIs from random URLs, shortened URLs etc.? That means everything that looks like a URL we're supposed to follow and screen-scrape, scan the resulting HTML (robots.txt violations, anybody?), and if some part of this process fails to divine what the user actually wanted, then it's considered a selectedpapers.net bug (user concludes " selectedpapers.net doesn't work")?

It doesn't make sense to promulgate a policy that doesn't work, i.e. user follows the policy, but spnet still can't figure out what paper he wanted to cite. If we did that, people would have every justification to be mad at us. We should only issue a policy if we know that it's technically implementable, i.e. "user followed the policy" = "spnet correctly indexed the paper citation". And then we should fix any errors where our implementation doesn't follow the policy perfectly.

On the other hand, I suspect we need a secondary, "search engine indexing" process whose goal is simply "do your best to recognize paper URLs" perhaps using the kind of tricks Marc described. I think such an indexing effort is probably unavoidable, because most people are unaware of selectedpapers.net. As I outlined in my previous comment, such indexing could be displayed on selectedpapers.net in a truncated "search engine results" format (i.e. not full-text). However, because such a process is highly imperfect, we must not recommend it to users as "here's how to write your posts" for full-text display on selectedpapers.net.

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-25933335 .

cjlee112 commented 10 years ago

@fgdorais "We should do exactly what the dx.doi.org servers do" ??? dx.doi.org vs. selectedpapers.net indexing have completely different goals:

semorrison commented 10 years ago

I suspect @fgdorais' intent was "accept anything the dx.doi.org servers would accept". I'm not exactly sure what that would mean in practice.

Corresponding to my suggestion for handling arxiv URLs, I propose we recognize exactly three URLs schemes for DOIs:

doi:. DOI:. http://dx.doi.org/.*

On Wed, Oct 9, 2013 at 9:41 AM, cjlee112 notifications@github.com wrote:

@fgdorais https://github.com/fgdorais "We should do exactly what the dx.doi.org servers do" ??? dx.doi.org vs. selectedpapers.net indexing have completely different goals:

  • dx.doi.org simply forwards the user to some random URL, whatever the journal has provided to crossref.org as the destination page for that DOI.
  • our indexing is trying to extract a paper ID (DOI) from a piece of text. Note in particular that the URL handed back by dx.doi.org is not a paper ID. The whole reason the DOI system exists is because these destination URLs are not designed to be paper identifiers. We are far better off sticking with the string sent to dx.doi.org (a quasi DOI, but likely roughed up a bit by urlencoding), then with the URL it sends back.

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-25933886 .

cjlee112 commented 10 years ago

@semorrison Can you elaborate a bit on those "How, exactly, am I meant to do this?" threads, please... I'm under the impression that the major cause of trouble is Google+ simply refusing to index #spnetwork hashtags for about a third of posts (i.e. searching on that hashtag even in the Google+ web interface will fail to return those posts, same as the G+ search API fails to return those posts). If you click "Get updates" on that person's page on selectedpapers.net, it will successfully index those posts (because this doesn't use a #spnetwork hashtag search). And the right advice to such users is to go click their "Get updates" button.

Are you saying there's some other problem, specifically with people being unsure how to paste in an arXiv ID?

semorrison commented 10 years ago

Here are the two instances I had in mind:

https://plus.google.com/110765980098077923527/posts/2HUgnh5owJh https://plus.google.com/109098098298652828653/posts/ZkqF2ZXWCMw

(I forget the exact edit history on the second one, I can ask him what he tried first if you like.)

On Wed, Oct 9, 2013 at 9:48 AM, cjlee112 notifications@github.com wrote:

@semorrison https://github.com/semorrison Can you elaborate a bit on those "How, exactly, am I meant to do this?" threads, please... I'm under the impression that the major cause of trouble is Google+ simply refusing to index #spnetwork hashtags for about a third of posts (i.e. searching on that hashtag even in the Google+ web interface will fail to return those posts, same as the G+ search API fails to return those posts). If you click "Get updates" on that person's page on selectedpapers.net, it will successfully index those posts (because this doesn't use a #spnetwork hashtag search). And the right advice to such users is to go click the "Get updates" button.

Are you saying there's some other problem, specifically with people being unsure how to paste in an arXiv ID?

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-25934312 .

cjlee112 commented 10 years ago

Ah, I see, he just assumed the URL was the identifier...

fgdorais commented 10 years ago

@cjlee112 Sorry, that wasn't very clear. My understanding of how dx.doi.org servers work is as follows. They get a HTTP request which consist of a string. This string may have been urlencoded to meet the requirements of the protocol. In fact, it could even be unnecessarily urlencoded, such as the following:

http://dx.doi.org/%31%30%2e%31%30%30%37%2f%73%30%30%31%35%33%2d%30%31%32%2d%30%32%39%37%2d%34

They urldecode the string which should match a registered DOI, otherwise you hit DOI not found. (This is a user error, so it's not our problem.)

As far as I can tell, urldecoding the string in a dx.doi.org url always gives a valid DOI because that's what the dx.doi.org servers do to process a request.

cjlee112 commented 10 years ago

hmm, this DOI de-urlencoding business scares me; can anybody point to Python code that does de-urlencoding in an officially mandated way? (so we can at least say "Not Our Bug" if there turn out to be issues with DOI munging)...

fgdorais commented 10 years ago

The official documentation is RFC 3986 and errata, if that helps.

cjlee112 commented 10 years ago

if someone wants to implement this based on well-tested, well-documented library code for de-urlencoding, we'd certainly consider merging to it master.

fgdorais commented 10 years ago

Standard library urllib.unquote has the desired functionality.

ketch commented 10 years ago

I went to try implementing this, but I'm a bit confused. In spnet/incoming.py, lines 41-61, a wide range of patterns (including full arXiv urls) are recognized. Am I looking in the wrong place?

cjlee112 commented 10 years ago

Hi David, I just added that code but haven't deployed it to the production website yet. I also wasn't quite sure whether it really covers all the cases. perhaps you and Scott can comment...

Yours

Chris On Oct 11, 2013 11:32 PM, "David Ketcheson" notifications@github.com wrote:

I went to try implementing this, but I'm a bit confused. In spnet/incoming.py, lines 41-61, a wide range of patterns (including full arXiv urls) are recognized. Am I looking in the wrong place?

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-26192125 .

ketch commented 10 years ago

I guess the "truly old school" URLs mentioned in #71 are still missing:

http://xxx.lanl.gov/abs/gr-qc/9401010

@cjlee112 I suggest that when anyone commits code related to an issue, they mention the issue in the commit message. That will avoid this kind of confusion, since the commit will then show up in the discussion thread here.

semorrison commented 10 years ago

I think we should not bother with xxx.lanl.gov URLs. There are quite a few different arXiv mirrors out there in the wild; if someone is really keen on them, I can find a list of regexes we used over at MathOverflow to try to catch more, but I think it is sufficiently diminishing returns that we should stop at just the 'standard' http://arxiv.org/abs/ and http://arxiv.org/pdf/ URLs.

semorrison commented 10 years ago

I sent pull requests for minor bugs in the regexes for arxiv URLs. I'd be inclined not to use the [abspdf]{3} hack, and just copy-and-paste an extra regex.

cjlee112 commented 10 years ago

@ketch You're absolutely right. I try to do that but will try both to be more consistent about that and also to add a comment in the issue tracker at the same time. In this case, I actually did mention this issue number in my commit, e.g. see

https://github.com/cjlee112/spnet/commits/master

ketch commented 10 years ago

@cjlee112 Ah, I do see where you mentioned it. The reason it did not show up in the tracker thread here is that you must put the pound sign in front of the number: #23. And thanks for fixing this already!

cjlee112 commented 10 years ago

@ketch I'll put pound sign before issues numbers in the future. thanks for pointing that out!

cjlee112 commented 10 years ago

Here's one case where we should recommend users to give their DOI reference using a dx.doi.org URL (or shortDOI): older DOIs can include ANY legal graphic unicode character, hence there is no strictly correct way for determining where the DOI terminates. By contrast, the urlencoded DOI can only include characters allowed by RFC 1738, and hence is terminated by any non-allowed character.

Note this is not an issue for post-2009 DOIs issued by CrossRef, which they say are limited to the following characters: "a-z", "A-Z", "0-9" and "-._;()/", see http://www.crossref.org/02publishers/doi-guidelines.pdf

cjlee112 commented 10 years ago

@fgdorais @semorrison @ketch @pkra Here's a proposal for how people should cite DOIs, based in part on basic problems with the DOI format, described in issue #98

Please let me know your thoughts.

semorrison commented 10 years ago

Looks good to me.

It catches all the cases that will happen in practice (dx.doi.org URLs, and DOIs copied and pasted from somewhere else). In my experience the bad character sets of old DOIs is a vanishly rare problem (in mathematics).

I'm not sure what "urging users" will consist of. I doubt that more than epsilon of users of #spnetwork will ever read an FAQ or instructions; they'll just copy what they see someone else doing on their preferred social network.

On Fri, Oct 18, 2013 at 5:42 AM, cjlee112 notifications@github.com wrote:

@fgdorais https://github.com/fgdorais @semorrisonhttps://github.com/semorrison @ketch https://github.com/ketch @pkra https://github.com/pkra Here's a proposal for how people should cite DOIs, based in part on basic problems with the DOI format, described in issue #98https://github.com/cjlee112/spnet/issues/98

  • urge users to give a shortDOI as this is guaranteed to work (see http://shortdoi.org for details); spnet uses shortDOIs internally for that reason.
  • dx.doi.org URLs are also allowed;
  • pasting DOI:... strings is allowed but not recommended, and in particular may fail for DOIs before 2009 (whose character set is incompatible with the modern DOI rules... give them a link for details if they want to know).

Please let me know your thoughts.

— Reply to this email directly or view it on GitHubhttps://github.com/cjlee112/spnet/issues/23#issuecomment-26537326 .

cjlee112 commented 10 years ago

@semorrison "Urging users" might mean notifying a user that spnet hit an error when it tried to index his post, and giving helpful advice about how to resolve the problem a la new issue #101 . E.g. if it failed to resolve a putative DOI, it could suggest "please give a shortDOI instead".

cjlee112 commented 10 years ago

Unfortunately... trying to paste DOI URLs into a post doesn't seem to fix the problem as we hoped, at least not on G+. I just checked what happens when a user tries to paste a DOI after http://dx.doi.org/: G+ screws it up. Specifically, for a DOI that contains a <, G+ makes a link but only for the part up to the <. I expected them to apply urlencoding but instead they just turned it into an ugly mess, from which I strongly doubt we could recover the correct DOI. Here's the example: https://plus.google.com/b/111368023899233259117/111368023899233259117

For these ugly pre-2009 DOIs I'm afraid shortDOI is looking like the only reliable solution, at least that I can find.