Closed wkiri closed 3 years ago
To fix these problems, we should expand the years in the regular expression (see the code above).
I have confirmed that these code updates solve the header issue and UTF-8 degree symbol issue, and that parse_all.py
runs with our current CoreNLP 4.2.0.
Some LPSC PDFs accidentally include the page header in the text content.
From PHX sentences.csv:
There are some cases of this in the MPF sentences.csv as well: