earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

Pull no more than one impact statement per paper #5

Closed paulsedille closed 3 years ago

paulsedille commented 3 years ago

Is your feature request related to a problem? Please describe. Currently, the code pulls any section within a paper whose title includes the word "impact." This includes any section that happens to include impact in its name, even if it is not an "Impact Statement." This also means that if a paper includes an impact statement AND another section that has "impact" in the title, the code will output more than one "impact statement" per article.

Describe the solution you'd like In order to minimise these problems, I would like it if the code only pulled the last section that includes "impact" in its title in cases where there is more than one such section. By "last" I mean "that appears latest in the body of the xml/paper". This is because impact statements are typically placed at the end of a paper since they do not count for the 8-page limit imposed by NeurIPS; therefore, if more than one section has "impact" in the title, the correct one to pull is most likely the latest one.

Describe alternatives you've considered Not sure; could "copy over" each old section in the dataframe as a new one is found?

paulsedille commented 3 years ago

There is a problem with this method, as some sections with "impact" in the title being erroneously pulled, in fact, appear after the proper impact statement (as in 90599c8fdd2f6e7a03ad173e2f535751). This is because the code is pulling "sections" appearing in the bibliography when these happen to include "impact" under what the xml labelled "h1."

earlng commented 3 years ago

The current logic is that the code will continue parsing through a PDF document until it finds all instances of what it considers to be impact statements. The rough criteria being:

I could change the logic so that once it finds the first instance that satisfies the above requirement, it just breaks out of the loop and moves on to the next file.

Would that work?

paulsedille commented 3 years ago

No, best not to do that. I just sifted through the current output.csv and there seems to be only a handful of erroneous impact statements (ie texts pulled by the code that are not actually impact statements), something like 8 of them. So I'll just go through that manually. What I would like though, is for the code to include every single paper in the output.csv, regardless of whether or not an impact statement is found. In the case nothing is found the code can include only the paper title, identifier, and link. The other columns can stay empty. We can consider this issue closed once the code does that.