Closed kwalcock closed 2 years ago
@kwalcock I am not sure how bad it is that we remove numbers from tables---granted we do not read from tables, it's probably fine as long as it helps to assemble paragraphs?
Now I am curious. Processors does not know about paragraphs, but does something else?
Yeah, I don't think we use paragraph information for anything in processors at the moment, so it's probably fine.
I apparently did not press the button while the information below was current. It seems to be working. I don't think the page numbers are being removed very often. Usually it's some number from a table instead.
This is still being tested locally, but @maxaalexeeva may want to take a look, particularly at https://github.com/clulab/pdf2txt/blob/kwalcock/fancyTrim/scienceparse/src/main/scala/org/clulab/pdf2txt/scienceparse/ParagraphPreprocessor.scala.