Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
I'm passing triplets to the trainer. After preparing the data, when I try to train I am facing an error while reading the files generated by trainer.prepare_training_data. My corpus and queries files looks good, one text per line. My triplet looks good, a list of three numbers per line. All looks good so far. When I try to train, I get the following error.
[Aug 19, 09:49:51] #> Loading the queries from data/queries.train.colbert.tsv ...
Process Process-9:
Traceback (most recent call last):
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process
return_val = callee(config, *args)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/training/training.py", line 43, in train
reader = LazyBatcher(config, triples, queries, collection, (0 if config.rank == -1 else config.rank), config.nranks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 28, in __init__
self.queries = Queries.cast(queries)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 113, in cast
return cls(path=obj)
^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 17, in __init__
self._load_data(data) or self._load_file(path)
^^^^^^^^^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/data/queries.py", line 52, in _load_file
self.data = load_queries(path)
^^^^^^^^^^^^^^^^^^
File "/home/usr/miniconda/envs/colbert2/lib/python3.11/site-packages/colbert/evaluation/loaders.py", line 22, in load_queries
qid, query, *_ = line.strip().split('\t')
^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected at least 2, got 1)
An odd observation is that when I pass only train_examples[:1000] instead of the whole set, it seems to work and training starts. Train is example is a list of triplet tuples
I'm passing triplets to the trainer. After preparing the data, when I try to train I am facing an error while reading the files generated by
trainer.prepare_training_data
. My corpus and queries files looks good, one text per line. My triplet looks good, a list of three numbers per line. All looks good so far. When I try to train, I get the following error.my coed:
Error:
An odd observation is that when I pass only train_examples[:1000] instead of the whole set, it seems to work and training starts. Train is example is a list of triplet tuples