clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Hyphen space needs to be handled differently #65

Open kwalcock opened 1 year ago

kwalcock commented 1 year ago

We're seeing documents in which parts of hyphenated words are not separated by \n but instead by a space. Something in a pipeline has tried to put an entire paragraph into a single line and seems to have just replaced \n by a space without taking into account the hyphens. The converter does not expect this and a special pass needs to be made over the text to look for these. Here is a list with a lot of the suspicious instances:

wrong right notes
ac- counting accounting  
appro- priately appropriately  
com- partmental compartmental  
com- partments compartments  
com- putational computational  
cov- erage coverage  
COVID- 19 COVID-19 remove space only
cu- mulative cumulative  
cumu- lative cumulative  
Death- Only Death-Only remove space only
develop- ment development  
distri- bution distribution  
Dormand- Prince Dormand-Prince remove space only
effec- tively effectively  
epi- demic epidemic  
epidemi- ological epidemiological  
ev- idence evidence  
Fixed- Detection Fixed-Detection remove space only
Fore- cast Forecast  
fore- casts forecasts  
forecast- ing corecasting  
im- plementation implementation  
includ- ing including  
loca- tion location  
log- odds log-odds remove space only
Maclau- rin Maclaurin  
Mech- Bayes MechBayes  
nonpara- metric nonparametric  
observ- able observable  
one- to one- to do not remove even space
param- eters parameters  
population- wide population-wide remove space only
pos- terior posterior  
pre- diction prediction  
Prepa- ration Preparation  
prereq- uisite prerequisite  
prob- abilistic probabilistic  
probabil- ity probability  
proper- ties properties  
ra- tio ratio  
re- productive reproductive  
rea- son reason  
rel- ative relative  
report- ing reporting  
res- piratory respiratory  
respon- sibility responsibility  
set- ting setting  
strate- gies strategies  
time- varying time-varying remove space only
un- certainty uncertainty  
vari- ables variables  
kwalcock commented 1 year ago

Right now WordBreakByHyphen is getting

population-wide
time-varying

but WordBreakBySpace is messing up

one-to
kwalcock commented 1 year ago

@enoriega, pdf2txt is not doing well on hyphenated words because they do not appear at the end of lines. Can you check your pipeline to see if something is removing EOLs (and replacing them with a spaces)?

enoriega commented 1 year ago

@kwalcock looking into that. Most likely is coming this way out of COSMOS

enoriega commented 1 year ago

@kwalcock I asked Ian Ross about this and they can adapt COSMOS to keep the new line characters for us with a toggle. I think this is likely to happen after the Hackaton, so let's circle back to it soon.

kwalcock commented 1 year ago

That would be great!