Improving code fault-tolerance

fliqper / GAP-Citations-Analyzator

GAP Citations project

1 stars 0 forks source link

Improving code fault-tolerance #12

Open olexandr-konovalov opened 3 years ago

olexandr-konovalov commented 3 years ago

What can possibly go wrong, and which situations the code should be able to handle:

MR number can lead to a page for a newly added entry for an article, which is being processed. The page has no citations at all yet
- may be revisited later, to check if citations appear
or there may be citations on MathSciNet, but they don't have the GAP string
- may be worth to inspect manually. Hopefully not much. Maybe there are GAP mentions in the review, or in the full text of the paper. The database also included MR numbers when people reported to us "We used GAP for this paper".

fliqper commented 3 years ago

Ok, so in that case I take it that we will assume every input MR must have GAP citations. Therefore, pages returning no results (for whatever reason) I will store in another list review_later which can be used as input to the scraper after a period of time or can be inspected manually if they produce no citations after the second scrape.

fliqper commented 3 years ago

Citations containing "We used GAP for this paper" or similar I think it is good to include in the analysis and filter out later as incorrect perhaps, if you agree.

olexandr-konovalov commented 3 years ago

Important not to throw away too early data that may be still relevant to some things in #9 and maybe in #10. I think the code should deal with missing values properly. Pandas has some useful functionality for that like https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

olexandr-konovalov commented 3 years ago

P.S. @fliqper why this is closed? Might be too early to consider this implemented...

fliqper commented 3 years ago

Yes this is my idea - to deal with citations such as "We used GAP for this paper" using Pandas, in the second stage when pre-processing. But empty matches which return None I don't see how we can process them more properly, because they do not have any text or anything, this is why I put them aside for manual check or for scraping to be re-tried after a certain period of time.

fliqper commented 3 years ago

I added a few lines and now the script produces another .CSV file containing a list of all MRNs that did not contain GAP, so they can be manually checked or re-scraped after some time. I hope this is good enough, but if not please let me know.

olexandr-konovalov commented 3 years ago

Useful idea, I like it!