Open olexandr-konovalov opened 3 years ago
Ok, so in that case I take it that we will assume every input MR must have GAP citations. Therefore, pages returning no results (for whatever reason) I will store in another list review_later
which can be used as input to the scraper after a period of time or can be inspected manually if they produce no citations after the second scrape.
Citations containing "We used GAP for this paper" or similar I think it is good to include in the analysis and filter out later as incorrect perhaps, if you agree.
Important not to throw away too early data that may be still relevant to some things in #9 and maybe in #10. I think the code should deal with missing values properly. Pandas has some useful functionality for that like https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
P.S. @fliqper why this is closed? Might be too early to consider this implemented...
Yes this is my idea - to deal with citations such as "We used GAP for this paper" using Pandas, in the second stage when pre-processing.
But empty matches which return None
I don't see how we can process them more properly, because they do not have any text or anything, this is why I put them aside for manual check or for scraping to be re-tried after a certain period of time.
I added a few lines and now the script produces another .CSV file containing a list of all MRNs that did not contain GAP
, so they can be manually checked or re-scraped after some time. I hope this is good enough, but if not please let me know.
Useful idea, I like it!
What can possibly go wrong, and which situations the code should be able to handle: