Closed shubham526 closed 6 months ago
@bclavie Any comments on how to resolve this?
Ok, fixed this. It was an issue with the gcc
and gxx
version. I looked at the conda yml file in the official ColBERT repository and created a new environment with exactly those versions of the packages.
Hi @shubham526 ! First of all, I wish you good work and success. Could you please share the code that load fine-tuned model and fine-tuning code you used? Thank you for your interest.
I just used the code given in this repository. Look at examples here: https://github.com/bclavie/RAGatouille/tree/main/examples
Thank you so much for your reply @shubham526 ! I have a few more questions. I would be happy if you answer when you are available. Here is my FineTuning Code:
from ragatouille import RAGTrainer
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter
import os
import glob
import random
def main():
trainer = RAGTrainer(model_name="ColBERT_1.0", # ColBERT_1 for first sample
# pretrained_model_name="colbert-ir/colbertv2.0",
pretrained_model_name="intfloat/e5-base",
language_code="tr"
)
# pretrained_model_name: base model to train
# model_name: new name to trained model
# Path to the directory containing all the `.txt` files for indexing
folder_path = "/text" # text folder contains several txt files.
# Initialize lists to store the texts and their corresponding file names
all_texts = []
document_ids = []
# Read all `.txt` files in the specified folder and extract file names
for file_path in glob.glob(os.path.join(folder_path, "*.txt")):
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
all_texts.append(content)
document_ids.append(os.path.splitext(os.path.basename(file_path))[0]) # Extract file name without extension
# chunking
corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
documents = corpus_processor.process_corpus(documents=all_texts, document_ids=document_ids, chunk_size=256) # overlap=0.1 chosen
# To train retrieval models like colberts, we need training triplets: queries, positive passages, and negative passages for each query.
# fake query-relevant passage pair
queries = ["document relevant query-1",
"document relevant query-2",
"document relevant query-3",
"document relevant query-4",
"document relevant query-5",
"document relevant query-6"
] * 3
pairs = []
for query in queries:
fake_relevant_docs = random.sample(documents, 10)
for doc in fake_relevant_docs:
pairs.append((query, doc))
# prepare training data
trainer.prepare_training_data(raw_data=pairs,
data_out_path="./data_out_path",
all_documents=all_texts,
num_new_negatives=10,
mine_hard_negatives=True
)
trainer.train(batch_size=32,
nbits=4, # how many bits will trained-model use
maxsteps=500000,
use_ib_negatives=True, # in-batch negative for calculate loss
dim=128, # per embedding will be 128 dimensions
learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
doc_maxlen=256, # Maximum document length
use_relu=False, # Disable ReLU
warmup_steps="auto", # Defaults to 10%
)
if __name__ == "__main__":
main()
When I use my code, my model with a structure like the one below is recorded in checkpoints. colbert
I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and Colbert. Do you know any changes I need to make to the code or its internal library code?
Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?
Thanks again for your interest
@bclavie Hi. So I was finally able to fine-tune ColBERT using your library. But how do I load this model for inference? I assumed I needed to use
RAGPretrainedModel
but apparently not. I gives me an error. What am I missing?Below is the stack trace: