AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.98k stars 204 forks source link

llama_index_sentence_splitter issues #66

Open deter3 opened 9 months ago

deter3 commented 9 months ago

ragatouille 0.0.4b2 , ubuntu 22.04

when I using the sample code to run , documents is just a list of string .

Traceback (most recent call last): File "/workspace/three_methods_ranking2.py", line 160, in my_documents = processor.process_corpus(documents) File "/usr/local/lib/python3.10/dist-packages/ragatouille/data/corpus_processor.py", line 22, in process_corpus documents = self.document_splitter_fn(documents, **splitter_kwargs) File "/usr/local/lib/python3.10/dist-packages/ragatouille/data/preprocessors.py", line 9, in llama_index_sentence_splitter docs = [[Document(text=doc)] for doc in documents] TypeError: 'NoneType' object is not iterable

bclavie commented 8 months ago

Hey, could you provide more of your code? It looks like the issue here is that the documents that makes it to the preprocessor is None, so would be helpful to figure out how that happened!

manisnesan commented 7 months ago

@bclavie - I faced a similar issue when I ran the notebook 06-index_free_use.ipynb from examples.

Tried to create a reproducer in the code using Colab and faced

ValidationError: 1 validation error for Document
text
  none is not an allowed value (type=type_error.none.not_allowed)

The root cause is due to one of the page being empty and hence ragatouille is throwing the error as "ValidationError" which is the right behavior.

The user need to ensure passing only valid docs before passing to Corpus_Processor.process_corpus method. This issue is not a bug and can be closed.