fliqper / GAP-Citations-Analyzator

GAP Citations project
1 stars 0 forks source link

Improving code fault-tolerance #12

Open olexandr-konovalov opened 3 years ago

olexandr-konovalov commented 3 years ago

What can possibly go wrong, and which situations the code should be able to handle:

fliqper commented 3 years ago

Ok, so in that case I take it that we will assume every input MR must have GAP citations. Therefore, pages returning no results (for whatever reason) I will store in another list review_later which can be used as input to the scraper after a period of time or can be inspected manually if they produce no citations after the second scrape.

fliqper commented 3 years ago

Citations containing "We used GAP for this paper" or similar I think it is good to include in the analysis and filter out later as incorrect perhaps, if you agree.

olexandr-konovalov commented 3 years ago

Important not to throw away too early data that may be still relevant to some things in #9 and maybe in #10. I think the code should deal with missing values properly. Pandas has some useful functionality for that like https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

olexandr-konovalov commented 3 years ago

P.S. @fliqper why this is closed? Might be too early to consider this implemented...

fliqper commented 3 years ago

Yes this is my idea - to deal with citations such as "We used GAP for this paper" using Pandas, in the second stage when pre-processing. But empty matches which return None I don't see how we can process them more properly, because they do not have any text or anything, this is why I put them aside for manual check or for scraping to be re-tried after a certain period of time.

fliqper commented 3 years ago

I added a few lines and now the script produces another .CSV file containing a list of all MRNs that did not contain GAP, so they can be manually checked or re-scraped after some time. I hope this is good enough, but if not please let me know.

olexandr-konovalov commented 3 years ago

Useful idea, I like it!