First of all, thank you very much for this great dataset and open sourcing it!
I have a question about the text_best column in the input (as seen in in-header.tsv)
text_djvu text_tesseract text_textract text_best
Is the text_best column the "combination of pdf2djvu/djvu2hocr and tesseract tools." that you mention in the README? Or is it the result from Azure CV that is mentioned in the paper (Table 3)?
The detailed results (average F1-scores over 3 runs) of our baselines for Kleister challenges (test sets) for the best PDF processing tool.
This makes it sound like the best text extraction comes from Azure CV.
First of all, thank you very much for this great dataset and open sourcing it!
I have a question about the
text_best
column in the input (as seen inin-header.tsv
)text_djvu text_tesseract text_textract text_best
Is the
text_best
column the "combination of pdf2djvu/djvu2hocr and tesseract tools." that you mention in the README? Or is it the result from Azure CV that is mentioned in the paper (Table 3)?This makes it sound like the best text extraction comes from Azure CV.