Open kwalcock opened 1 year ago
Right now WordBreakByHyphen is getting
population-wide
time-varying
but WordBreakBySpace is messing up
one-to
@enoriega, pdf2txt is not doing well on hyphenated words because they do not appear at the end of lines. Can you check your pipeline to see if something is removing EOLs (and replacing them with a spaces)?
@kwalcock looking into that. Most likely is coming this way out of COSMOS
@kwalcock I asked Ian Ross about this and they can adapt COSMOS to keep the new line characters for us with a toggle. I think this is likely to happen after the Hackaton, so let's circle back to it soon.
That would be great!
We're seeing documents in which parts of hyphenated words are not separated by \n but instead by a space. Something in a pipeline has tried to put an entire paragraph into a single line and seems to have just replaced \n by a space without taking into account the hyphens. The converter does not expect this and a special pass needs to be made over the text to look for these. Here is a list with a lot of the suspicious instances: