MaterialEyes / exsclaim

A toolkit for the automatic construction of self-labeled materials imaging datasets from scientific literature
GNU General Public License v3.0
30 stars 8 forks source link

Repeated running of pipeline on same query has unexpected behavoir #1

Closed trevorspreadbury closed 4 years ago

trevorspreadbury commented 4 years ago

If you run the same query twice in a row, the program will report no articles found, no captions found, and an error for each of the figures previously found. After this runs, the exsclaim.json is emptied.

This might be fixed by storing information in a more sophisticated database (MongoDB or a SQL database).

A further enhancement for repeated running of the same query with a maximum_scraped amount that has been reached would be to scrap maximum_scraped additional articles for each running, without removing data from previous runs of course.

re-query_bug