Using haystack with documents

dsvrsec commented 4 years ago

Question I have few queries after checking out the colab notebook i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document

ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.

Additional context Add any other context or screenshots about the question (optional).

tholor commented 4 years ago

Hey @dsvrsec

i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document

Yes, absolutely. We have a couple of FileConverters in Haystack. As you are working with a variety of file formats, I would recommend using the TikaConvert (based on Apache Tika). As of now, that's the sketched workflow:

Start Apache Tika service in the background
```
docker run -d -p 9998:9998 apache/tika
```

Connect Haystack to Tika & convert file to python dict

>>> from haystack.file_converters.tika import TikaConverter
>>> tika_converter = TikaConverter(
    tika_url = "http://localhost:9998/tika",
    remove_numeric_tables = False,
    remove_whitespace = False,
    remove_empty_lines = False,
    remove_header_footer = False,
    valid_languages = None,
)
>>> dicts = []
>>> for path in filepaths:  
>>>     cur_dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>>     dicst.append(cur_dict)
>>> dicts
[{ 
"text": "everything on page one \f then page two \f ..."
'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}, ...]

Write all dicts to your document store (similar to the tutorial notebook)
```
document_store.write_documents(dicts)
```
Go on like in the tutorial and init Reader, Retriever + Finder

Side notes:

We currently plan to refactor this part of haystack, so that it will be easier to use in the future
For optimal performance (speed+accuracy), we recommend splitting long documents into smaller chunks, but let's keep it simple here for now :)

ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.

I am not fully understanding this. Can you please try to rephrase or explain in more detail what you want to do achieve here?

dsvrsec commented 4 years ago

Thanks i)Is there any way to use the feature without docker ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.

tholor commented 4 years ago

i)Is there any way to use the feature without docker

You can also run Tika without docker. It's a java project (https://tika.apache.org/1.24.1/gettingstarted.html)
OR you use one of the other file converters in Haystack, but they only cover PDF, docx and txt files

ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers

If I understand it correctly, you want to fine-tune a pre-trained QA model on your own domain data. Please see our documentation for details: https://haystack.deepset.ai/en/docs/domain_adaptationmd

and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.

Not sure what you mean with "subject extraction for better QA pairs". You could use a dense retriever (that uses embeddings) to "identify" candidate documents for your QA model. If you are doing your first steps in Haystack and QA, I would recommend starting with the Elasticssearchretriever and only later move towards dense retrievers.

dsvrsec commented 4 years ago

Thanks for the response.

I want to train the model with the domain documents and get the embeddings so that i can extract main keywords(by clustering) or subjects ,so that i can index the documents with the corresponding keywords..

tholor commented 4 years ago

If you want to get document embeddings:

You can get those easily from the dense retriever models in Haystack (EmbeddingRetriever or DensePassageRetriever):

from haystack import Document
docs = [Document(text="some text in your doc")]
...
retriever = EmbeddingRetriever(document_store=your_document_store, embedding_model="deepset/sentence_bert")
embedding = retriever.embed_passages(docs)

or

from haystack import Document
docs = [Document(text="some text in your doc")]
...
retriever = DensePassageRetriever(document_store=document_store,
                              query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                              passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base")
retriever.embed_passages(docs)

Domain training: I would advise to first try with existing pretrained models and then fine-tune on domain data only if needed. As mentioned above, for EmbeddingRetriever there's no training method in Haystack and for DensePassageRetriever we are currently implementing it. It should be released in the next 2-4 weeks. If you are eager to already do it now, you can use the original code bases of sentence-transformers or DPR to run training and then use your model in Haystack.

If you want to get word embeddings: This is currently not in scope for haystack. You could of course dig a bit deeper into the Reader or Retriever models and extract embeddings manually from there, but this will require some custom coding on your side.

tholor commented 4 years ago

Did this help @dsvrsec ?

dsvrsec commented 4 years ago

Did this help @dsvrsec ?

Yeah,thank you

deepset-ai / haystack

Using haystack with documents #394