Closed dsvrsec closed 4 years ago
Hey @dsvrsec
i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document
Yes, absolutely. We have a couple of FileConverters in Haystack. As you are working with a variety of file formats, I would recommend using the TikaConvert (based on Apache Tika). As of now, that's the sketched workflow:
docker run -d -p 9998:9998 apache/tika
>>> from haystack.file_converters.tika import TikaConverter
>>> tika_converter = TikaConverter(
tika_url = "http://localhost:9998/tika",
remove_numeric_tables = False,
remove_whitespace = False,
remove_empty_lines = False,
remove_header_footer = False,
valid_languages = None,
)
>>> dicts = []
>>> for path in filepaths:
>>> cur_dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
>>> dicst.append(cur_dict)
>>> dicts
[{
"text": "everything on page one \f then page two \f ..."
'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
}, ...]
document_store.write_documents(dicts)
Side notes:
ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.
I am not fully understanding this. Can you please try to rephrase or explain in more detail what you want to do achieve here?
Thanks i)Is there any way to use the feature without docker ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.
i)Is there any way to use the feature without docker
ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers
If I understand it correctly, you want to fine-tune a pre-trained QA model on your own domain data. Please see our documentation for details: https://haystack.deepset.ai/en/docs/domain_adaptationmd
and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.
Not sure what you mean with "subject extraction for better QA pairs". You could use a dense retriever (that uses embeddings) to "identify" candidate documents for your QA model. If you are doing your first steps in Haystack and QA, I would recommend starting with the Elasticssearchretriever and only later move towards dense retrievers.
Thanks for the response.
I want to train the model with the domain documents and get the embeddings so that i can extract main keywords(by clustering) or subjects ,so that i can index the documents with the corresponding keywords..
If you want to get document embeddings:
from haystack import Document
docs = [Document(text="some text in your doc")]
...
retriever = EmbeddingRetriever(document_store=your_document_store, embedding_model="deepset/sentence_bert")
embedding = retriever.embed_passages(docs)
or
from haystack import Document
docs = [Document(text="some text in your doc")]
...
retriever = DensePassageRetriever(document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base")
retriever.embed_passages(docs)
If you want to get word embeddings: This is currently not in scope for haystack. You could of course dig a bit deeper into the Reader or Retriever models and extract embeddings manually from there, but this will require some custom coding on your side.
Did this help @dsvrsec ?
Did this help @dsvrsec ?
Yeah,thank you
Question I have few queries after checking out the colab notebook i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document
ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.
Additional context Add any other context or screenshots about the question (optional).