clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Fix formatting issues #20

Closed kwalcock closed 2 years ago

kwalcock commented 2 years ago

One document, the attached, gets converted by Tika to a text file with each (or at least many of them) separated by blank lines as if they were separate paragraphs. The preprocessor then goes and adds periods to the ends of these separate paragraphs. Instead, it should join the lines into paragraphs. It is difficult to decide which thing it should do.

1-s2.0-S0378429001001289-main.pdf 1-s2.0-S0378429001001289-main-nopre.txt

MihaiSurdeanu commented 2 years ago

Thanks @kwalcock !

I think we can implement the following heuristic: merge two lines separated by an empty line if the last one starts with lower case characters. This would merge lines such as:

2.1. Experimental site, plant materials and initial

growth conditions

but would not merge:

growth conditions

All experiments were conducted at the research

or

2. Materials and methods

2.1. Experimental site, plant materials and initial
maxaalexeeva commented 2 years ago

Hi @kwalcock, are you working on this or shall I?

kwalcock commented 2 years ago

I'm planning to do it (make one pass) tomorrow along with the science parse.

kwalcock commented 2 years ago

This has been addressed in #22. GitHub is having problems, though, and not showing the most recent commits. However, they don't change the output anyway.