AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.05k stars 208 forks source link

How to load a fine-tuned model? #211

Closed shubham526 closed 6 months ago

shubham526 commented 6 months ago

@bclavie Hi. So I was finally able to fine-tune ColBERT using your library. But how do I load this model for inference? I assumed I needed to use RAGPretrainedModel but apparently not. I gives me an error. What am I missing?

Below is the stack trace:

>>> RAG = RAGPretrainedModel.from_pretrained("/home/schatte4/.ragatouille/colbert/none/2024-05/03/23.46.26/checkpoints/colbert")
/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[May 04, 00:02:07] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Traceback (most recent call last):
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
    subprocess.run(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/ragatouille/RAGPretrainedModel.py", line 71, in from_pretrained
    instance.model = ColBERT(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 84, in __init__
    self.inference_ckpt = Checkpoint(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/checkpoint.py", line 19, in __init__
    super().__init__(name, colbert_config)
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/colbert.py", line 24, in __init__
    ColBERT.try_load_torch_extensions(self.use_gpu)
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/colbert.py", line 39, in try_load_torch_extensions
    segmented_maxsim_cpp = load(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1309, in load
    return _jit_compile(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'segmented_maxsim_cpp': [1/2] /home/schatte4/anaconda3/envs/rag-env/bin/x86_64-conda-linux-gnu-g++ -MMD -MF segmented_maxsim.o.d -DTORCH_EXTENSION_NAME=segmented_maxsim_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/TH -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/THC -isystem /home/schatte4/anaconda3/envs/rag-env/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -c /disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/segmented_maxsim.cpp -o segmented_maxsim.o 
FAILED: segmented_maxsim.o 
/home/schatte4/anaconda3/envs/rag-env/bin/x86_64-conda-linux-gnu-g++ -MMD -MF segmented_maxsim.o.d -DTORCH_EXTENSION_NAME=segmented_maxsim_cpp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/TH -isystem /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/THC -isystem /home/schatte4/anaconda3/envs/rag-env/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -c /disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/segmented_maxsim.cpp -o segmented_maxsim.o 
In file included from /disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/x86_64-conda-linux-gnu/include/c++/11.2.0/chrono:42,
                 from /disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/x86_64-conda-linux-gnu/include/c++/11.2.0/mutex:39,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/c10/util/typeid.h:8,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/c10/core/ScalarTypeToTypeMeta.h:5,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:18,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                 from /home/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/torch/include/torch/extension.h:5,
                 from /disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/lib/python3.10/site-packages/colbert/modeling/segmented_maxsim.cpp:2:
/disk/nfs/ostrom/schatte4/anaconda3/envs/rag-env/x86_64-conda-linux-gnu/include/c++/11.2.0/ctime:80:11: error: 'timespec_get' has not been declared in '::'
   80 |   using ::timespec_get;
      |           ^~~~~~~~~~~~
ninja: build stopped: subcommand failed.
shubham526 commented 6 months ago

@bclavie Any comments on how to resolve this?

shubham526 commented 6 months ago

Ok, fixed this. It was an issue with the gcc and gxx version. I looked at the conda yml file in the official ColBERT repository and created a new environment with exactly those versions of the packages.

4entertainment commented 6 months ago

Hi @shubham526 ! First of all, I wish you good work and success. Could you please share the code that load fine-tuned model and fine-tuning code you used? Thank you for your interest.

shubham526 commented 6 months ago

I just used the code given in this repository. Look at examples here: https://github.com/bclavie/RAGatouille/tree/main/examples

4entertainment commented 6 months ago

Thank you so much for your reply @shubham526 ! I have a few more questions. I would be happy if you answer when you are available. Here is my FineTuning Code:

from ragatouille import RAGTrainer
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter
import os
import glob
import random

def main():
    trainer = RAGTrainer(model_name="ColBERT_1.0",  # ColBERT_1 for first sample
                         # pretrained_model_name="colbert-ir/colbertv2.0",
                         pretrained_model_name="intfloat/e5-base",
                         language_code="tr"
                         )
    # pretrained_model_name: base model to train
    # model_name: new name to trained model

    # Path to the directory containing all the `.txt` files for indexing
    folder_path = "/text" # text folder contains several txt files.
    # Initialize lists to store the texts and their corresponding file names
    all_texts = []
    document_ids = []
    # Read all `.txt` files in the specified folder and extract file names
    for file_path in glob.glob(os.path.join(folder_path, "*.txt")):
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            all_texts.append(content)
            document_ids.append(os.path.splitext(os.path.basename(file_path))[0])  # Extract file name without extension

    # chunking
    corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
    documents = corpus_processor.process_corpus(documents=all_texts, document_ids=document_ids, chunk_size=256) # overlap=0.1 chosen

    # To train retrieval models like colberts, we need training triplets: queries, positive passages, and negative passages for each query.
    # fake query-relevant passage pair
    queries = ["document relevant query-1",
               "document relevant query-2",
               "document relevant query-3",
               "document relevant query-4",
               "document relevant query-5",
               "document relevant query-6"
    ] * 3
    pairs = []
    for query in queries:
        fake_relevant_docs = random.sample(documents, 10)
        for doc in fake_relevant_docs:
            pairs.append((query, doc))

    # prepare training data
    trainer.prepare_training_data(raw_data=pairs,
                                  data_out_path="./data_out_path",
                                  all_documents=all_texts,
                                  num_new_negatives=10,
                                  mine_hard_negatives=True
                                  )
    trainer.train(batch_size=32,
                  nbits=4,  # how many bits will trained-model use
                  maxsteps=500000,
                  use_ib_negatives=True,  # in-batch negative for calculate loss
                  dim=128,  # per embedding will be 128 dimensions
                  learning_rate=5e-6,  # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
                  doc_maxlen=256,  # Maximum document length
                  use_relu=False,  # Disable ReLU
                  warmup_steps="auto",  # Defaults to 10%
    )
if __name__ == "__main__":
    main()

When I use my code, my model with a structure like the one below is recorded in checkpoints. colbert

I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and Colbert. Do you know any changes I need to make to the code or its internal library code?

Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?

Thanks again for your interest