clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
32 stars 5 forks source link

Improve hyphenation and preprocess ligatures #11

Closed kwalcock closed 2 years ago

kwalcock commented 2 years ago

Hyphen processing takes into account words already containing hyphens and other words in the document that are without hyphens, which form a DictionaryLanguageModel themselves.

MihaiSurdeanu commented 2 years ago

Very nice. Thanks @kwalcock !