New hashtag parsing, for posts with multiple references

ketch commented 10 years ago

This PR implements new routines for parsing hashtags in posts, with the following features:

Resolves #80 by correctly identifying the first paper referenced after the #spnetwork tag (the current code does not)
Resolves issue #98 by patching the DOI regex
Resolves #100
Parses multiple references and assigns an adjacent tag to each

Implicit in the PR is a language for referencing papers in SPnet posts. This language follows my understanding of what has been decided by the developers so far. Specifically:

Each post must contain the #spnetwork hashtag. The first paper referenced after the post is considered the primary reference.
Each paper reference may also be immediately preceded by one of the following "semantic" hashtags: #recommend, #mustread, #discuss, or #announce. In this case, that tag is attached to that particular reference.
If no semantic hashtag is provided for a given reference, the default #discuss is assigned.

As an example, the new code correctly parses this (from my post here):

#spnetwork #recommend doi:10.1007/BF01389633

#discuss doi:10.1147/rd.112.0215
#discuss doi:10.1007/BF01932030
#numericalAnalysis

Note that G+ renders this as HTML, so HTML tags appear in the raw string that must be processed. Also note that this post was not processed correctly by the current code; see https://selectedpapers.net/posts/z12xvr0qokfvwtbvd22sjl5ogtjyjz2c4 where the comment is assigned to the wrong paper (it is not assigned to the first reference following #spnetwork).

I think there should be a discussion of the syntax for tagging, so this PR should be viewed as a starting point for conversation. Also, I touched a significant amount of code and this is my first major contribution, so I'd appreciate it being closely reviewed by the experts. I did add some tests, including things that fail under the old code.

ketch commented 10 years ago

@cjlee112 Thanks for the criticism.

Do you have any examples where it fails? If so, please add some (or give me enough information to implement them).

Writing code for unspecified generalities ("must work for any content source") is beyond my psychic powers. ;-)

ketch commented 10 years ago

Looks like there is a nice solution for a regex that matches DOIs here:

http://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page

In fact, the symbol '<' is discussed specifically in the comments on the first answer, and a solution proposed. I think that what we want is this:

'\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'])\S)+)\b'

Indeed, this passes the tests in my branch.

cjlee112 commented 10 years ago

what about the case where post is already in the db, but content has changed? Looks like updating is now broken, since code that was outside if post is None test has been moved inside it.

cjlee112 commented 10 years ago

I think we're better off analyzing the full content that we're given, rather running it through dehtml(). For example, we're being pushed to recognize DOI URLs. For that we want to see the <A HREF="http://dx.doi.org/..."> HTML, because that should be urlencoded and hence decodable in a standard way (to recover the original DOI). By contrast, the text version of the URL (assuming there is one!) has no clear standard for how it will have been munged. For example, if the DOI contains < or > presumably G+ (or whoever's serving the post) will have encoded those (to prevent them mucking up its HTML) but how can we know all vendors will follow a universal standard for that munging (such that we can de-munge it in an ironclad way)?

ketch commented 10 years ago

I've opened a new PR (#102) with a minimum set of changes to implement the new parsing. The DOI parsing issue can be tackled separately.

cjlee112 / spnet

New hashtag parsing, for posts with multiple references #99