Closed ketch closed 10 years ago
@cjlee112 I've updated the branch to reflect the convention that we discussed today. All tests still pass, and I'm only using '\n' as line breaks, so we should next write tests involving and
and make it work for those. Any other "line breaks" we want to handle?@ketch You need to merge in my separate_parsing
branch in order to see all the new tests I committed last week. You can run the tests via nosetests -v
(for further details see my email of Oct. 15).
@ketch: since you didn't yet merge in my separate_parsing
commits of Oct. 15-19, just a heads-up on some minor issues:
separate_parsing
soon, since various indexing initiatives (MathOverflow; blogs etc.) are waiting on that.@cjlee112 I've merged and now two tests are failing. One of them tests the text
this is text #spnetwork #recommend arXiv:1302.4871 PMID: 22291635 #cosmology
According to the rules above, this is a user error because there is a citation type tag on a line with two citations. How do you want to handle this? The test wants the citation type to be applied to the first reference after the citation type tag, I guess. That makes sense to me.
The other failing test is trying to update a paper and the line that fails is
assert core.Paper(paper1._id).posts == []
I guess this is checking to make sure that the updated post is not already in the database. I think this failure may not be related to my parsing function. Can you help me out?
One other note: while debugging, I made the parsing function a bit ugly. I will refactor it to be prettier once the tests are passing.
@ketch When more than one citation is present on a line with a citationType, I think the citationType should apply to all the citations in that line. Otherwise we get into troublesome questions about which one to associate it with (by proximity? by order?)
The second test failure reflects a basic updating problem: the code isn't checking whether it needs to update the paper links as a result of the updated post text. The test catches that by seeing that the "old" paper still seems to be linked to (this) post, whereas it should actually have no posts linked to it.
@ketch At this point all you need to do is apply citationType to all citations in the same line, and all tests should pass.
@cjlee112 I've updated the parsing function to reflect your suggestion. Now we just find the first citationType tag in each line and apply it to all citations in that line. Note that one test had an assertion that contradicted this behavior, so I changed the test.
I also wrote a bit of documentation for this, and removed the documentation of a bunch of other tags that would right now (under the current code or this PR) simply be interpreted as topics.
I would like to add #announce as a valid citationType, but will request that in a separate PR once this is merged.
@ketch This code is now in production on selectedpapers.net. Note that I had to update the DOI regexp (switched it to post-2009 character rules) to prevent a crash on a post that included a DOI citation. I added a new test for that specific case. We can try to improve DOI handling in a separate issue.
This PR does two things:
This also updates the relevant tests, without changing their meaning -- it just calls the new parsing function. All tests pass for me.
This is meant to replace most of https://github.com/cjlee112/spnet/pull/99, while sticking to a minimum set of changes that should be easier to review. It does not modify the regex strings used to match DOIs.