AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.03k stars 206 forks source link

error loading corpus and train data #242

Open MohammedAlhajji opened 2 months ago

MohammedAlhajji commented 2 months ago

I'm passing triplets to the trainer. After preparing the data, when I try to train I am facing an error while reading the files generated by trainer.prepare_training_data. My corpus and queries files looks good, one text per line. My triplet looks good, a list of three numbers per line. All looks good so far. When I try to train, I get the following error.

my coed:

trainer.prepare_training_data(train_examples, mine_hard_negatives=False)
trainer.train(batch_size=32)

Error:

[Aug 19, 09:49:51] #> Loading the queries from data/queries.train.colbert.tsv ...
Process Process-9:
Traceback (most recent call last):
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/training/training.py", line 43, in train
    reader = LazyBatcher(config, triples, queries, collection, (0 if config.rank == -1 else config.rank), config.nranks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 28, in __init__
    self.queries = Queries.cast(queries)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 113, in cast
    return cls(path=obj)
           ^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 17, in __init__
    self._load_data(data) or self._load_file(path)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 52, in _load_file
    self.data = load_queries(path)
                ^^^^^^^^^^^^^^^^^^
  File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/evaluation/loaders.py", line 22, in load_queries
    qid, query, *_ = line.strip().split('\t')
    ^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected at least 2, got 1)

An odd observation is that when I pass only train_examples[:1000] instead of the whole set, it seems to work and training starts. Train is example is a list of triplet tuples