Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.32k stars 4.23k forks source link

[Question] working with other file formats and file sources #633

Open mratanusarkar opened 1 year ago

mratanusarkar commented 1 year ago

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Question

  1. Is it possible to work with files other than PDF? Say, text files, doc, excel, and even code files like .py .c, .cpp, .java, other programming files, and any file from any data source? Would be even cooler if I can point to a data source (eg: drive, cloud resource link or say a github repo)

  2. Instead of running azd init -t azure-search-openai-demo, Is it possible to do the init from a fork of this codebase, that has it's own custom file data for a specific use case in the data folder? In that way, when the app gets deployed, the blob will have different PDF files and the system will be tuned to file data with custom use-case.

  3. With reference to the Answer From the FAQ:

    How can we upload additional PDFs without redeploying everything?

I see data folder inside the deployed webapp, app service kudu terminal, and also in the storage container BLOB named "content". I don't have access to ./scripts/prepdocs.sh or ./scripts/prepdocs.ps1 as the deployment is currently done by other team members. How do I manipulate the system so that the content is from my file source and not the existing ones from demo? (I believe, dumping my files into the blob or inside the app service -> data won't help, as the files are indexed and converted to some vector DB so that the model get's the reference)

OS and Version?

  • Windows 10 Enterprise
  • Azure Cloud Shell --> Bash

azd version?

run azd version and copy paste here.

Versions

azd version 1.2.0

Pitrak commented 1 year ago

i think that is something that i would also need. I wish this model to use my c++/ c#/java etc code to use a context for answering the answers.

pamelafox commented 1 year ago

The data ingestion script in this repo (prepdocs.py) only has support for PDFs. There's support for a few other file formats in another repo (https://github.com/microsoft/sample-app-aoai-chatGPT/blob/main/scripts/data_utils.py). You can also check out open-source tools like llamaindex (https://www.llamaindex.ai/) to see if they can be used with your data format type. The basic idea is to split your text into "chunks" of a reasonable size and store them in the search index.

I'm afraid I don't understand the other questions, can you rephrase? You can use your own documents by replacing what's in that folder. You can remove existing documents by passing --removeall to prepdocs.py in prepdocs.sh/prepdocs.ps1.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.