deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.82k stars 1.85k forks source link

Using haystack with documents #394

Closed dsvrsec closed 3 years ago

dsvrsec commented 4 years ago

Question I have few queries after checking out the colab notebook i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document

ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.

Additional context Add any other context or screenshots about the question (optional).

tholor commented 4 years ago

Hey @dsvrsec

i)If I have a set of documents(pdf,ppt,docx,xlsx).Is there any way i can use haystack ,to build query system.If so,can you please provide a reference document

Yes, absolutely. We have a couple of FileConverters in Haystack. As you are working with a variety of file formats, I would recommend using the TikaConvert (based on Apache Tika). As of now, that's the sketched workflow:

  1. Start Apache Tika service in the background
    docker run -d -p 9998:9998 apache/tika
  2. Connect Haystack to Tika & convert file to python dict
    >>> from haystack.file_converters.tika import TikaConverter
    >>> tika_converter = TikaConverter(
        tika_url = "http://localhost:9998/tika",
        remove_numeric_tables = False,
        remove_whitespace = False,
        remove_empty_lines = False,
        remove_header_footer = False,
        valid_languages = None,
    )
    >>> dicts = []
    >>> for path in filepaths:  
    >>>     cur_dict = tika_converter.convert(file_path=Path("test/samples/pdf/sample_pdf_1.pdf"))
    >>>     dicst.append(cur_dict)
    >>> dicts
    [{ 
    "text": "everything on page one \f then page two \f ..."
    'meta': {'Content-Type': 'application/pdf', 'Creation-Date': '2020-06-02T12:27:28Z', ...}
    }, ...]
  3. Write all dicts to your document store (similar to the tutorial notebook)
    document_store.write_documents(dicts)
  4. Go on like in the tutorial and init Reader, Retriever + Finder

Side notes:

ii) Can i train these documents on pretrained model ,just to get the embeddings with out using for any downstream tasks.

I am not fully understanding this. Can you please try to rephrase or explain in more detail what you want to do achieve here?

dsvrsec commented 3 years ago

Thanks i)Is there any way to use the feature without docker ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.

tholor commented 3 years ago

i)Is there any way to use the feature without docker

ii)The use case I am working on is preparing QA pairs for training the bot. I want to train these documents on pretrained model so that I can extract domain dependent answers

If I understand it correctly, you want to fine-tune a pre-trained QA model on your own domain data. Please see our documentation for details: https://haystack.deepset.ai/en/docs/domain_adaptationmd

and also I can use the embeddings for subject extraction for better QA pairs. Can you also suggest if there is any approach you can suggest.

Not sure what you mean with "subject extraction for better QA pairs". You could use a dense retriever (that uses embeddings) to "identify" candidate documents for your QA model. If you are doing your first steps in Haystack and QA, I would recommend starting with the Elasticssearchretriever and only later move towards dense retrievers.

dsvrsec commented 3 years ago

Thanks for the response.

I want to train the model with the domain documents and get the embeddings so that i can extract main keywords(by clustering) or subjects ,so that i can index the documents with the corresponding keywords..

tholor commented 3 years ago

If you want to get document embeddings:

tholor commented 3 years ago

Did this help @dsvrsec ?

dsvrsec commented 3 years ago

Did this help @dsvrsec ?

Yeah,thank you