ERROR IN split_and_upload(): Traceback: [<FrameSummary file /app/ai_ta_backend/vector_database.py, line 780 in split_and_upload>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 136 in create_documents>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 687 in split_text>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 669 in _split_text>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 250 in _tiktoken_encoder>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/tiktoken/core.py, line 117 in encode>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/tiktoken/core.py, line 351 in raise_disallowed_special_token>]
❌❌ Error in split_and_upload:Encountered text corresponding to disallowed special token '<|endoftext|>'.
This is a particular problem with scraping github repos related to AI... need to have a try catch for now and do a better fix later. Somehow sanitize special tokens. That's the new sql injection.