Iodine98 / dora-back

A Python backend for Document Retrieval and Analysis (DoRA).
MIT License
0 stars 1 forks source link

Replace TokenTextSplitter with RecursiveCharacterTextSplitter and add CHUNK_OVERLAP environment variable #28

Closed Iodine98 closed 5 months ago

Iodine98 commented 5 months ago

This pull request replaces the TokenTextSplitter class with the RecursiveCharacterTextSplitter class in the langchain.text_splitter module. The RecursiveCharacterTextSplitter works better on textual documents like PDFs because it keeps sentences and paragraphs together. Additionally, this pull request adds the CHUNK_OVERLAP environment variable, which allows users to specify the chunk overlap for text splitting.