This pull request includes fixes for the bugs inserted in the latest updates made to add support to optimized xlsx and transcription format chunking and introduction of ChunkerFactory approach.
PR includes changes to improve the handling of different file formats, optimize chunking processes, and enhance logging for better debugging. The most important changes include removing support for certain file formats, modifying chunking logic, and adding new environment variables.
File Format Handling:
Removed support for epub, rtf, docx, doc, pptx, ppt, msg, and pdf file formats from the LangChainChunker. (chunking/chunkers/langchain_chunker.py, README.md) [1][2][3]
Chunking Logic:
Updated SpreadsheetChunker to use a new environment variable SPREADSHEET_NUM_TOKENS for the maximum chunk size and added detailed logging for chunk creation. (chunking/chunkers/spreadsheet_chunker.py) [1][2][3]
Modified LangChainChunker to download and decode blob data before chunking. (chunking/chunkers/langchain_chunker.py) [1][2][3]
Adjusted TranscriptionChunker to process VTT files directly from blob data and updated chunking logic. (chunking/chunkers/transcription_chunker.py) [1][2][3][4]
Environment Variables:
Introduced SPREADSHEET_NUM_TOKENS in local.settings.json.template to control the maximum chunk size for spreadsheet chunking. (local.settings.json.template)
Logging and Error Handling:
Enhanced logging for error messages and chunking processes to provide more detailed information. (chunking/document_chunking.py, function_app.py) [1][2][3]
Retry Mechanism:
Increased the maximum number of retries for rate limit errors in AzureOpenAIClient from 3 to 10. (tools/aoai.py) [1][2]
This pull request includes fixes for the bugs inserted in the latest updates made to add support to optimized xlsx and transcription format chunking and introduction of ChunkerFactory approach.
PR includes changes to improve the handling of different file formats, optimize chunking processes, and enhance logging for better debugging. The most important changes include removing support for certain file formats, modifying chunking logic, and adding new environment variables.
File Format Handling:
epub
,rtf
,docx
,doc
,pptx
,ppt
,msg
, andpdf
file formats from theLangChainChunker
. (chunking/chunkers/langchain_chunker.py
,README.md
) [1] [2] [3]Chunking Logic:
SpreadsheetChunker
to use a new environment variableSPREADSHEET_NUM_TOKENS
for the maximum chunk size and added detailed logging for chunk creation. (chunking/chunkers/spreadsheet_chunker.py
) [1] [2] [3]LangChainChunker
to download and decode blob data before chunking. (chunking/chunkers/langchain_chunker.py
) [1] [2] [3]TranscriptionChunker
to process VTT files directly from blob data and updated chunking logic. (chunking/chunkers/transcription_chunker.py
) [1] [2] [3] [4]Environment Variables:
SPREADSHEET_NUM_TOKENS
inlocal.settings.json.template
to control the maximum chunk size for spreadsheet chunking. (local.settings.json.template
)Logging and Error Handling:
chunking/document_chunking.py
,function_app.py
) [1] [2] [3]Retry Mechanism:
AzureOpenAIClient
from 3 to 10. (tools/aoai.py
) [1] [2]