clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Assimilate fancy trim of ScienceParse conversion #42

Closed kwalcock closed 2 years ago

kwalcock commented 2 years ago

I apparently did not press the button while the information below was current. It seems to be working. I don't think the page numbers are being removed very often. Usually it's some number from a table instead.

This is still being tested locally, but @maxaalexeeva may want to take a look, particularly at https://github.com/clulab/pdf2txt/blob/kwalcock/fancyTrim/scienceparse/src/main/scala/org/clulab/pdf2txt/scienceparse/ParagraphPreprocessor.scala.

maxaalexeeva commented 2 years ago

@kwalcock I am not sure how bad it is that we remove numbers from tables---granted we do not read from tables, it's probably fine as long as it helps to assemble paragraphs?

kwalcock commented 2 years ago

Now I am curious. Processors does not know about paragraphs, but does something else?

MihaiSurdeanu commented 2 years ago

Yeah, I don't think we use paragraph information for anything in processors at the moment, so it's probably fine.