Azure / gpt-rag-ingestion

MIT License
61 stars 53 forks source link

Ingestion pipeline #81

Closed placerda closed 2 months ago

placerda commented 2 months ago

This pull request introduces multiple updates to the chunking process introducing the chunking module, along with configuration files, to enhance the project's flexibility. These changes make it easier to extend existing chunkers and add new ones. Additionally, it includes minor improvements such as adding Visual Studio Code configuration files, cleaning up the project by removing unused code, and deleting several utility functions and classes.

Chunking updates

The funcion app has been enhanced with the addition of several specialized chunkers to process different file types.

These chunkers include:

  1. TranscriptionChunker: Designed for .vtt files, this chunker processes transcription data to generate meaningful chunks suitable for further analysis.
  2. SpreadsheetChunker: Handles .xlsx files by segmenting spreadsheet data into manageable portions for processing.
  3. DocAnalysisChunker: Chunks after running Azure Document Intelligence Analysis. Applicable for various image formats such as .pdf, .png, .jpeg, .jpg, .bmp, and .tiff, as well as for .docx and .pptx files (when using the Document Intelligence API 4.0). When using DocInt 4.0 API it also uses markdown output to improve chunks context.
  4. LangChainChunker: A fallback chunker that is utilized when the file type doesn't match the specific chunkers mentioned above.

Index updates