UIUC-Chatbot / ai-ta-backend

API backend for UIUC AI Teaching Assistant.
https://docs.uiuc.chat/
MIT License
10 stars 7 forks source link

❌❌ Error in split_and_upload:Encountered text corresponding to disallowed special token '<|endoftext|>'. #47

Open KastanDay opened 1 year ago

KastanDay commented 1 year ago
ERROR IN split_and_upload(): Traceback: [<FrameSummary file /app/ai_ta_backend/vector_database.py, line 780 in split_and_upload>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 136 in create_documents>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 687 in split_text>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 669 in _split_text>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/langchain/text_splitter.py, line 250 in _tiktoken_encoder>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/tiktoken/core.py, line 117 in encode>, <FrameSummary file /opt/venv/lib/python3.8/site-packages/tiktoken/core.py, line 351 in raise_disallowed_special_token>]

❌❌ Error in split_and_upload:Encountered text corresponding to disallowed special token '<|endoftext|>'.
KastanDay commented 1 year ago

This is a particular problem with scraping github repos related to AI... need to have a try catch for now and do a better fix later. Somehow sanitize special tokens. That's the new sql injection.