A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
For local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.
The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.
Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.
For local files, MD5 checks are conducted using pre-generated .md5 files. However, when working with the data lake, no such checks are implemented, resulting in each file being reprocessed even if it remains unchanged.
The file strategy for the data lake should prevent the addition of files that have the same MD5 hash in both the data lake and the content storage/index. Therefore, the data lake and blob content storage (if the --skipblobs option is not used) should generate an MD5 value as blob metadata. This metadata can be verified before adding or downloading a file by simply retrieving the blob metadata MD5 and performing an indexed search. Ideally, all chunks should share the same MD5 in the index, as a single match would suffice to confirm that the file is already known. Alternatively, the MD5 can be stored in the blob's metadata alongside the "copied" blob.
Additionally, an optional prepdocs parameter that only processes files with a source update (touch) date newer than a specified threshold would be a useful feature for testing. This could facilitate queries against the data lake based on the last import or job run's persisted data.