Closed kwalcock closed 2 years ago
Thanks @kwalcock !
I think we can implement the following heuristic: merge two lines separated by an empty line if the last one starts with lower case characters. This would merge lines such as:
2.1. Experimental site, plant materials and initial
growth conditions
but would not merge:
growth conditions
All experiments were conducted at the research
or
2. Materials and methods
2.1. Experimental site, plant materials and initial
Hi @kwalcock, are you working on this or shall I?
I'm planning to do it (make one pass) tomorrow along with the science parse.
This has been addressed in #22. GitHub is having problems, though, and not showing the most recent commits. However, they don't change the output anyway.
One document, the attached, gets converted by Tika to a text file with each (or at least many of them) separated by blank lines as if they were separate paragraphs. The preprocessor then goes and adds periods to the ends of these separate paragraphs. Instead, it should join the lines into paragraphs. It is difficult to decide which thing it should do.
1-s2.0-S0378429001001289-main.pdf 1-s2.0-S0378429001001289-main-nopre.txt