cjlee112 / spnet

selected papers network web engine
http://thinking.bioinformatics.ucla.edu/2011/07/02/open-peer-review-by-a-selected-papers-network/
GNU General Public License v2.0
40 stars 11 forks source link

New hashtag parsing, for posts with multiple references #99

Closed ketch closed 10 years ago

ketch commented 10 years ago

This PR implements new routines for parsing hashtags in posts, with the following features:

Implicit in the PR is a language for referencing papers in SPnet posts. This language follows my understanding of what has been decided by the developers so far. Specifically:

As an example, the new code correctly parses this (from my post here):

#spnetwork #recommend doi:10.1007/BF01389633

#discuss doi:10.1147/rd.112.0215
#discuss doi:10.1007/BF01932030
#numericalAnalysis

Note that G+ renders this as HTML, so HTML tags appear in the raw string that must be processed. Also note that this post was not processed correctly by the current code; see https://selectedpapers.net/posts/z12xvr0qokfvwtbvd22sjl5ogtjyjz2c4 where the comment is assigned to the wrong paper (it is not assigned to the first reference following #spnetwork).

I think there should be a discussion of the syntax for tagging, so this PR should be viewed as a starting point for conversation. Also, I touched a significant amount of code and this is my first major contribution, so I'd appreciate it being closely reviewed by the experts. I did add some tests, including things that fail under the old code.

ketch commented 10 years ago

@cjlee112 Thanks for the criticism.

ketch commented 10 years ago

Looks like there is a nice solution for a regex that matches DOIs here:

http://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page

In fact, the symbol '<' is discussed specifically in the comments on the first answer, and a solution proposed. I think that what we want is this:

'\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'])\S)+)\b'

Indeed, this passes the tests in my branch.

cjlee112 commented 10 years ago

what about the case where post is already in the db, but content has changed? Looks like updating is now broken, since code that was outside if post is None test has been moved inside it.

cjlee112 commented 10 years ago

I think we're better off analyzing the full content that we're given, rather running it through dehtml(). For example, we're being pushed to recognize DOI URLs. For that we want to see the <A HREF="http://dx.doi.org/..."> HTML, because that should be urlencoded and hence decodable in a standard way (to recover the original DOI). By contrast, the text version of the URL (assuming there is one!) has no clear standard for how it will have been munged. For example, if the DOI contains < or > presumably G+ (or whoever's serving the post) will have encoded those (to prevent them mucking up its HTML) but how can we know all vendors will follow a universal standard for that munging (such that we can de-munge it in an ironclad way)?

ketch commented 10 years ago

I've opened a new PR (#102) with a minimum set of changes to implement the new parsing. The DOI parsing issue can be tackled separately.