Spider doesn't have to bail if it sees an unrecognized paper in category search

blekhmanlab / rxivist

API providing access to papers and authors scraped from biorxiv.org

GNU Affero General Public License v3.0

59 stars 11 forks source link

During the step that searches the category lists to assign categories to preprints that have already been recorded, the crawler currently exists completely if it finds a paper in this list that it doesn't already know about. This was done to make it more obvious that something fishy was going on—if it finds a bunch of known papers in a category and THEN an unrecognized paper, this is indeed fishy. However, if the first few papers in a category haven't been recorded yet, this isn't actually a problem—they'll be picked up on the next run. We should allow unrecognized papers at the beginning of a category.

blekhmanlab / rxivist

Spider doesn't have to bail if it sees an unrecognized paper in category search #228