blekhmanlab / rxivist

API providing access to papers and authors scraped from biorxiv.org
https://rxivist.org
GNU Affero General Public License v3.0
59 stars 11 forks source link

Spider doesn't have to bail if it sees an unrecognized paper in category search #228

Closed rabdill closed 5 years ago

rabdill commented 5 years ago

During the step that searches the category lists to assign categories to preprints that have already been recorded, the crawler currently exists completely if it finds a paper in this list that it doesn't already know about. This was done to make it more obvious that something fishy was going on—if it finds a bunch of known papers in a category and THEN an unrecognized paper, this is indeed fishy. However, if the first few papers in a category haven't been recorded yet, this isn't actually a problem—they'll be picked up on the next run. We should allow unrecognized papers at the beginning of a category.

rabdill commented 5 years ago

https://github.com/blekhmanlab/rxivist/commit/eb8fb6184679525372e5503c77d51ebf246a4fc8 https://github.com/blekhmanlab/rxivist/commit/ab1f4f49ab6edd6b1588d0d058223ad27cdf3a02