Use tika to convert wide spectrum of text documents

dany-nonstop commented 4 years ago

Is your feature request related to a problem? Please describe. Limited file format supports, only txt, docx and pdf at this moment

Describe the solution you'd like Support extracting text and searching within a wide spectrum of document formats

Describe alternatives you've considered Apache Tika might be a promising candidate with support for most documents we could care about. A thin layer could be added to translate BaseConverter call to tika REST api call. Maybe with the help of tika-python. Among the supported document formats, html, ms office documents, odf, pdf, rtf are particularly interesting for most users. Especially tika has dealt with all character encoding internally.

Additional context other benefits

tika's language detection would also be available automatically
higher quality and consistency across supported document formats (maybe)
haystack team could focus more on QA instead of reinvent the wheel
scalability if more documents need to be converted. I would like to volunteer to make it happen if this proposal is accepted. Upon my casual study, it seem the only feature missing from the current implementation might be to retain the page number a paragraph belongs to.

tanaysoni commented 4 years ago

Hi @dany-nonstop, thank you for the proposal and volunteering to contribute! The support for more file formats would be immensely useful for the users.

The suggested approach of having a TikaConverter instance that extends the BaseConvertor sounds good to me. In our initial implementation for PDFs, we used pdftotext library as it provided good out-of-the-box support for extracting text from multi-column pdfs and removing numeric tabular data. It'd be nice to have that implemented for Tika converter as well.

retain the page number a paragraph belongs to

From a quick search, I found that Tika adds <div><p> at the start and </p></div> at the end of each page for PDFs. If that indeed works well, then we can split a file into pages like in other existing file converters.

Let me know if there's anything to discuss before you get started. If you prefer, you can create an early work-in-progress pull request.

dany-nonstop commented 4 years ago

I'll start working on it over the weekend.

nsankar commented 4 years ago

@dany-nonstop @tholor Tika is good, but it is Java and using tika server through the python rpc interface has its own cons... An equivalent python library to tika is Textract https://textract.readthedocs.io/en/stable/ . I have used textract's document converters along with haystack's cleaner function and it works great! just thought i would share this info.

dany-nonstop commented 4 years ago

@dany-nonstop @tholor Tika is good, but it is Java and using tika server through the python rpc interface has its own cons... An equivalent python library to tika is Textract https://textract.readthedocs.io/en/stable/ . I have used textract's document converters along with haystack's cleaner function and it works great! just thought i would share this info.

@nsankar, I somehow agree with you that textract may be more attractive in terms of dependency involved. Tika is more heavy weight and involves setting up Java environment. But I see two points making Tika attractive:

a Tika REST service is much more scalable, you can easily scale to many document converters running in parallel
Tika is more actively developed and supports more file formats

I'm now going with Tika first. But feel free to do something with textact if you happen to have some time to spare. ;)

dany-nonstop commented 4 years ago

@tanaysoni I've made the first working version already, It subclasses your file_converters.base.BaseConverter. I use the official Tika docker container to host the document conversion service, I feel this way we can shield users from installing / building java environment by themselves or tracking update of Tika. It is also easily scalable in the future (a reverse proxy + lots of document conversion workers maybe).

We need to discuss a little how to integrate: my plan is to revise readme with a simple tutorial on using tika document converter. or should I create a separate readme, but then where do I put it?

ps. by testing your PDFToTextConverter I realized that you rely on external pdftotext utility tool to convert PDF to text. That's another software to install for the users... At least I had to manually install xpdf on my development machine.

tanaysoni commented 4 years ago

Hi @dany-nonstop, that sounds good! As for the documentation, @brandenchan is working on documentation in #272. There's a dedicated section for File Converters, so we can add a getting started guide there. Additionally, a link to the guide can be added in the main README file. What do you think?

external pdftotext utility tool

yes, that's how the current implementation is. Hopefully, the Tika converter would make things better :slightly_smiling_face:

dany-nonstop commented 4 years ago

Hi @tanaysoni and @tholor I have created a pull request for tika converter. My pull request is on master branch, as I forked from there. I also added a few lines of documents, directly updated to the current readme file. Maybe later we could move it to documentation branch #272 you mentioned.

Tika tends to convert docs into small fragments, I added a little code to merge lines into paragraphs in utils/tika_convert_files_to_dicts, and did a little test with Alice's Adventure in Wonderland pdf and html version, it works like a charm. The results are better than the current pdf converter. I believe this could help improve the accuracy of QA.

on page 2 tika converter properly splits the text at paragraph ends

CHAPTER XII. Alice's Evidence CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth \n \nthe trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

while the current PDF converter cannot make the right split, and the last sentence is cut off

CHAPTER I. Down the Rabbit-Hole\nAlice was beginning to get very tired of sitting by her sister on the bank, and of having\nnothing to do: once or twice she had peeped into the book her sister was reading, but it had\nno pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without\npictures or conversation?'\nSo she was considering in her own mind (as well as she could, for the hot day made her\nfeel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth

the same goes for the first page's right column of a random paper's PDF I'm reading https://arxiv.org/abs/1911.02116

tika converter properly splits the text into 3 paragraphs, but fails to remove the text line on the left margin, which creeps into the text stream at the end of the first page.

Multilingual masked language models (MLM)\nlike mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) have pushed the stateof-the-art on cross-lingual understanding tasks\nby jointly pretraining large Transformer models (Vaswani et al., 2017) on many languages.\nThese models allow for effective cross-lingual\ntransfer, as seen in a number of benchmarks including cross-lingual natural language inference\n(Bowman et al., 2015; Williams et al., 2017; Conneau et al., 2018), question answering (Rajpurkar\net al., 2016; Lewis et al., 2019), and named entity recognition (Pires et al., 2019; Wu and Dredze,\n2019). However, all of these studies pre-train on\nWikipedia, which provides a relatively limited scale\nespecially for lower resource languages.

... (second paragraph)

Our best model XLM-RoBERTa (XLM-R) outperforms mBERT on cross-lingual classification by\nup to 23% accuracy on low-resource languages. It\noutperforms the previous state of the art by 5.1% average accuracy on XNLI, 2.42% average F1-score ar\nX iv\n:1 91\n1.

while the current PDF converter merges all paragraphs in that column together in a big chunk and cuts off the last sentence at the end of the page

'Multilingual masked language models (MLM)\nlike mBERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019) have pushed the stateof-the-art on cross-lingual understanding tasks\nby jointly pretraining large Transformer models (Vaswani et al., 2017) on many languages.\nThese models allow for effective cross-lingual\ntransfer, as seen in a number of benchmarks including cross-lingual natural language inference\n(Bowman et al., 2015; Williams et al., 2017; Conneau et al., 2018), question answering (Rajpurkar\net al., 2016; Lewis et al., 2019), and named entity recognition (Pires et al., 2019; Wu and Dredze,\n2019). However, all of these studies pre-train on\nWikipedia, which provides a relatively limited scale\nespecially for lower resource languages.\n ...... (second paragraph) ...... Our best model XLM-RoBERTa (XLM-R) outperforms mBERT on cross-lingual classification by\nup to 23% accuracy on low-resource languages. It\noutperforms the previous state of the art by 5.1% average accuracy on XNLI, 2.42% average F1-score'

tanaysoni commented 4 years ago

Hi @dany-nonstop, thank you for working on this and providing detailed examples. The conversions look very good.

dany-nonstop commented 4 years ago

Hi @dany-nonstop, thank you for working on this and providing detailed examples. The conversions look very good.

Just made another revision to be fully compatible with return values to pass unit test.

mchari commented 4 years ago

Is the TikaConverter available for general use ? Thanks!

tholor commented 4 years ago

@mchari Yes, absolutely. It was merged with #314. We just forgot to close this issue here :). You can find a brief description here: https://github.com/deepset-ai/haystack#7-indexing-pdf--docx-files

The usage is very similar to the other Converters. The only difference: you need to make sure to run a Tika Server in the background (e.g. via docker). If any further questions come up, let us know!

mchari commented 4 years ago

thanks @tholor

deepset-ai / haystack

Use tika to convert wide spectrum of text documents #275