What exactly is text_best?

ivo-1 commented 2 years ago

First of all, thank you very much for this great dataset and open sourcing it!

I have a question about the text_best column in the input (as seen in in-header.tsv)

text_djvu text_tesseract text_textract text_best

Is the text_best column the "combination of pdf2djvu/djvu2hocr and tesseract tools." that you mention in the README? Or is it the result from Azure CV that is mentioned in the paper (Table 3)?

The detailed results (average F1-scores over 3 runs) of our baselines for Kleister challenges (test sets) for the best PDF processing tool.

This makes it sound like the best text extraction comes from Azure CV.

tstanislawek commented 2 years ago

Hey @ivo-1 , you are right, the text_best column comes from text extracted by Azure CV.

Best

ivo-1 commented 2 years ago

Thanks so much for the quick reply! Appreciate it 🙏

applicaai / kleister-charity

What exactly is text_best? #3