cjlee112 / spnet

selected papers network web engine
http://thinking.bioinformatics.ucla.edu/2011/07/02/open-peer-review-by-a-selected-papers-network/
GNU General Public License v2.0
40 stars 11 forks source link

Implement new hashtag parsing for multiple citations. #102

Closed ketch closed 10 years ago

ketch commented 10 years ago

This PR does two things:

This also updates the relevant tests, without changing their meaning -- it just calls the new parsing function. All tests pass for me.

This is meant to replace most of https://github.com/cjlee112/spnet/pull/99, while sticking to a minimum set of changes that should be easier to review. It does not modify the regex strings used to match DOIs.

ketch commented 10 years ago

@cjlee112 I've updated the branch to reflect the convention that we discussed today. All tests still pass, and I'm only using '\n' as line breaks, so we should next write tests involving
and

and make it work for those. Any other "line breaks" we want to handle?

cjlee112 commented 10 years ago

@ketch You need to merge in my separate_parsing branch in order to see all the new tests I committed last week. You can run the tests via nosetests -v (for further details see my email of Oct. 15).

cjlee112 commented 10 years ago

@ketch: since you didn't yet merge in my separate_parsing commits of Oct. 15-19, just a heads-up on some minor issues:

ketch commented 10 years ago

@cjlee112 I've merged and now two tests are failing. One of them tests the text

this is text #spnetwork #recommend arXiv:1302.4871 PMID: 22291635 #cosmology

According to the rules above, this is a user error because there is a citation type tag on a line with two citations. How do you want to handle this? The test wants the citation type to be applied to the first reference after the citation type tag, I guess. That makes sense to me.

The other failing test is trying to update a paper and the line that fails is

assert core.Paper(paper1._id).posts == []

I guess this is checking to make sure that the updated post is not already in the database. I think this failure may not be related to my parsing function. Can you help me out?

One other note: while debugging, I made the parsing function a bit ugly. I will refactor it to be prettier once the tests are passing.

cjlee112 commented 10 years ago

@ketch When more than one citation is present on a line with a citationType, I think the citationType should apply to all the citations in that line. Otherwise we get into troublesome questions about which one to associate it with (by proximity? by order?)

The second test failure reflects a basic updating problem: the code isn't checking whether it needs to update the paper links as a result of the updated post text. The test catches that by seeing that the "old" paper still seems to be linked to (this) post, whereas it should actually have no posts linked to it.

cjlee112 commented 10 years ago

@ketch At this point all you need to do is apply citationType to all citations in the same line, and all tests should pass.

ketch commented 10 years ago

@cjlee112 I've updated the parsing function to reflect your suggestion. Now we just find the first citationType tag in each line and apply it to all citations in that line. Note that one test had an assertion that contradicted this behavior, so I changed the test.

I also wrote a bit of documentation for this, and removed the documentation of a bunch of other tags that would right now (under the current code or this PR) simply be interpreted as topics.

I would like to add #announce as a valid citationType, but will request that in a separate PR once this is merged.

cjlee112 commented 10 years ago

@ketch This code is now in production on selectedpapers.net. Note that I had to update the DOI regexp (switched it to post-2009 character rules) to prevent a crash on a post that included a DOI citation. I added a new test for that specific case. We can try to improve DOI handling in a separate issue.