clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Evaluate quality of converters #55

Open garanews opened 2 years ago

garanews commented 2 years ago

There is some converters comparison in terms of quality of the generated output?

kwalcock commented 2 years ago

No, there isn't. The output is highly dependent on the input, particularly on whether there are one or two columns, how many ligatures are present, how prevalent hyphenated words are, how reasonable the program is that built the PDF, how many equations there are, whether there are embedded images that need OCR, etc. If anyone knows of a standard test set of documents, that would be very helpful. The choice may also depend on what the output is used for. Here we usually can't get anything out of equations because they are too unlike sentences anyway, so it doesn't really matter whether they are right or wrong. If you can find a representative set of the documents you need to convert, I'd suggest giving it a whirl and judging for yourself (and then reporting back).