This pull request includes several changes to improve and fix the document chunking function app, focusing on refining the chunking process for various file types and enhancing logging and configuration settings. The most important changes include updates to the chunker classes, adjustments to the configuration settings, and improvements to logging.
Chunker Class Updates:
chunking/chunkers/langchain_chunker.py: Modified methods to download blob data and decode it into text before chunking. Added a text parameter to _chunk_content method. (F5d12d29L70R80, F5d12d29L96R100, F5d12d29L133R137)
chunking/chunkers/spreadsheet_chunker.py: Added environment variable support for max_chunk_size and improved logging for chunk processing. (F7853bb3L55R54, F7853bb3L71R69, F7853bb3L96R89)
This pull request includes several changes to improve and fix the document chunking function app, focusing on refining the chunking process for various file types and enhancing logging and configuration settings. The most important changes include updates to the chunker classes, adjustments to the configuration settings, and improvements to logging.
Chunker Class Updates:
chunking/chunkers/langchain_chunker.py
: Modified methods to download blob data and decode it into text before chunking. Added atext
parameter to_chunk_content
method. (F5d12d29L70R80, F5d12d29L96R100, F5d12d29L133R137)chunking/chunkers/spreadsheet_chunker.py
: Added environment variable support formax_chunk_size
and improved logging for chunk processing. (F7853bb3L55R54, F7853bb3L71R69, F7853bb3L96R89)chunking/chunkers/transcription_chunker.py
: Updated methods to passtext
parameter for chunking and improved logging. (chunking/chunkers/transcription_chunker.pyL63-R68, F420b885L78R79, F420b885L98R99, F420b885L107R108)Configuration and Logging Enhancements:
local.settings.json.template
: AddedSPREADSHEET_NUM_TOKENS
configuration setting.tools/aoai.py
: IncreasedMAX_RETRIES
for rate limit errors from 3 to 10.function_app.py
: RemoveddocumentContent
from logging and request schema. [1] [2]Documentation Updates:
README.md
: Updated supported formats and removed outdated information aboutDocument Intelligence API 4.0
. [1] [2] F0524c14L89R90)Minor Changes:
setup.py
: ChangeddataToExtract
configuration toallMetadata
. (Fb1524a3L663R667)chunking/document_chunking.py
: Corrected logging message format in_error_message
method.