Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.4k stars 4.28k forks source link

MD5 at DataLake is not available for ACL Solution #1742

Open singloudly90 opened 5 months ago

singloudly90 commented 5 months ago

Please provide us with the following information:

Understand that A added checks to see what's been uploaded before. The prepdocs script now writes an .md5 file with an MD5 hash of each file that gets uploaded. Whenever the prepdocs script is re-run, that hash is checked against the current hash and the file is skipped if it hasn't changed.

However i realised when I tried on the ACL solution, MD5 didnt create as expected compared to the solution without ACL. correct me if I am wrong: Without ACL solution: Upload files from local folder, MD5 generated at local folder, files uploaded to blob storage and to AI Search Index. With ACL solution: Upload files from local folder to datalake, datalake to AI Search.

These solution are difference in term of file processing...

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

With ACL solution: Upload files from local folder to datalake, MD5 generated in datalake, datalake to blobstorage and to AI Search.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

pamelafox commented 5 months ago

cc @mattgotteiner

So is your goal to be able to repeatedly re-run prepdocs to pick up new files in ADLS2, without having to re-index existing files? I think we'd probably want to implement https://github.com/Azure-Samples/azure-search-openai-demo/pull/942 for both normal Blob storage and ADLS2, which would mean the MD5 would be stored in the blob itself, and we'd check against that.

RCGEnableBigDataDeveloper commented 4 months ago

@pamelafox this could be a great feature, since in production, the docs are sitting somewhere on the lake that other system maybe able to drop files into.