bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Spurious warnings about sentenceJoin models #51

Closed Proyag closed 4 years ago

Proyag commented 4 years ago

https://github.com/bitextor/pdf-extract/blob/9a3258b1b517ad4fd185f1720a37eb0700d67e28/src/pdfextract/PDFExtract.java#L1179-L1190 checks _hashSentenceJoin for entries for all languages in doc.langList, and complains about the ones it can't find.

The problem is

https://github.com/bitextor/pdf-extract/blob/9a3258b1b517ad4fd185f1720a37eb0700d67e28/src/pdfextract/PDFExtract.java#L1564-L1574 which actually populates _hashSentenceJoin only adds an entry for one language per document.

As a result, for every document, we get "No model for language" warnings for all languages except the most common one, even though the models exist.