earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

Improve sentence count #21

Open paulsedille opened 3 years ago

paulsedille commented 3 years ago

Currently, the code counts BIS (broader impact statement) sentences by simply counting the number of final punctuation markers (. ! ?). This is not perfect because strings like "e.g." or "1.5 gallons" incorrectly add to the sentence count.

Ideally, the script would take these exceptional cases into account and reflect this in the final count.

There is an easy fix for the two most common occurrences: e.g. and i.e., which would be to subtract "2" from the sentence count for every separate occurence of either substrings ("e.g." or "i.e.") in the BIS text. More complex solutions might be (1) to automatically dismiss any sentence that is shorter than X characters (around 3-10 seems appropriate) and/or (2) only count ".", "!", or "?" if they are followed by a blank space (that is, count ". ", "! " and "? "). This would help exclude rarer false positives, for example tables, lists or numerical values that include full stops (such as "934.2" or "1. Computation Cost, 2. Training Data" etc.)

paulsedille commented 3 years ago

I've realised, only counting ".", "!", or "?" if they are followed by a blank space might skip sentences at the end of paragraphs (depending on how that is coded in the xml?).

earlng commented 3 years ago

Following a suggestion here I think we can use the nltk package instead of re for the sentence count. (Documentation here.

The issue is that for cases that include e.g. or i.e. it still double counts them. But it does take into consideration decimal points, so I consider it an improvement.

earlng commented 3 years ago

It's generally ok. But if an e.g. or particularly messy sentence is involved, it could be off by about 1-2 sentences.