This pull request introduces multiple updates to the chunking process introducing the chunking module, along with configuration files, to enhance the project's flexibility. These changes make it easier to extend existing chunkers and add new ones. Additionally, it includes minor improvements such as adding Visual Studio Code configuration files, cleaning up the project by removing unused code, and deleting several utility functions and classes.
Chunking updates
The funcion app has been enhanced with the addition of several specialized chunkers to process different file types.
These chunkers include:
TranscriptionChunker: Designed for .vtt files, this chunker processes transcription data to generate meaningful chunks suitable for further analysis.
SpreadsheetChunker: Handles .xlsx files by segmenting spreadsheet data into manageable portions for processing.
DocAnalysisChunker: Chunks after running Azure Document Intelligence Analysis. Applicable for various image formats such as .pdf, .png, .jpeg, .jpg, .bmp, and .tiff, as well as for .docx and .pptx files (when using the Document Intelligence API 4.0). When using DocInt 4.0 API it also uses markdown output to improve chunks context.
LangChainChunker: A fallback chunker that is utilized when the file type doesn't match the specific chunkers mentioned above.
Index updates
Added additional fields to the index, such as relatedFiles, relatedImages, and summary, to support more advanced retrieval scenarios.
set security_id as a fixed field in the index, even if it is not used initially.
[!IMPORTANT]
This pull request should be merged in sync with https://github.com/Azure/GPT-RAG/pull/183 because of the role assignments needed by the Data Ingestion Function App.
This pull request introduces multiple updates to the chunking process introducing the
chunking
module, along with configuration files, to enhance the project's flexibility. These changes make it easier to extend existing chunkers and add new ones. Additionally, it includes minor improvements such as adding Visual Studio Code configuration files, cleaning up the project by removing unused code, and deleting several utility functions and classes.Chunking updates
The funcion app has been enhanced with the addition of several specialized chunkers to process different file types.
These chunkers include:
.vtt
files, this chunker processes transcription data to generate meaningful chunks suitable for further analysis..xlsx
files by segmenting spreadsheet data into manageable portions for processing..pdf
,.png
,.jpeg
,.jpg
,.bmp
, and.tiff
, as well as for.docx
and.pptx
files (when using the Document Intelligence API 4.0). When using DocInt 4.0 API it also uses markdown output to improve chunks context.Index updates
relatedFiles
,relatedImages
, andsummary
, to support more advanced retrieval scenarios.security_id
as a fixed field in the index, even if it is not used initially.