McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
816 stars 59 forks source link

How do I run MNTP training locally #93

Open Georgepitt opened 3 weeks ago

Georgepitt commented 3 weeks ago

Hello, I'm here to raise the issue again. As you know the computing cluster provided by the lab needs to run offline. So I want to run MNTP training locally.

I noticed that run_mtp.py datasets download code snippets, I tried to download the datasets to the local, modify config.json to load, but still failed, can you give me some advice?Thx!

datasets download code snippets

    if data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name,
            data_args.dataset_config_name,
            cache_dir=model_args.cache_dir,
            token=model_args.token,
            streaming=data_args.streaming,
        )
        if "validation" not in raw_datasets.keys():
            raw_datasets["validation"] = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                split=f"train[:{data_args.validation_split_percentage}%]",
                cache_dir=model_args.cache_dir,
                token=model_args.token,
                streaming=data_args.streaming,
            )
            raw_datasets["train"] = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                split=f"train[{data_args.validation_split_percentage}%:]",
                cache_dir=model_args.cache_dir,
                token=model_args.token,
                streaming=data_args.streaming,
            )
    else:
        data_files = {}
        if data_args.train_file is not None:
            data_files["train"] = data_args.train_file
            extension = data_args.train_file.split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
             raw_datasets = load_dataset(
            extension,
            data_files=data_files,
            cache_dir=model_args.cache_dir,
            token=model_args.token,
        )

download the datasets to the local

from datasets import load_dataset
import json
dataset = load_dataset("wikitext", "wikitext-103-raw-v1")

def save_to_json(dataset, split_name, file_path):
    data = dataset[split_name].to_dict()
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

save_to_json(dataset, 'train', 'E:\datasets_path\wikitext_train.json')
save_to_json(dataset, 'validation', 'E:\datasets_path\wikitext_validation.json')

mntp_gemma.json

"dataset_name": null,
"dataset_config_name": "wikitext-103-raw-v1",
"validation_file" : "/share/llm2vec_mntp/wikitext/wikitext_validation.json",
"train_file" : "/share/llm2vec_mntp/wikitext/wikitext_train.json",
Georgepitt commented 3 weeks ago

The errors logs is that

Downloading took 0.0 min
06/06/2024 13:12:06 - INFO - datasets.download.download_manager - Downloading took 0.0 min
Checksum Computation took 0.0 min
06/06/2024 13:12:06 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Generating train split
06/06/2024 13:12:06 - INFO - datasets.builder - Generating train spli

ValueError: Not able to read records in the JSON file at /share/home/Research_CodeSearch/llm2v/llm2vec_mntp/wikitext/wikitext_train.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['text']. Select the correct one and provide it as `field='XXX'` to the dataset loading method. 

  File "/share/home/llm2vec_mntp/run_mntp.py", line 605, in main
    raw_datasets = load_dataset(
File "/share/home/chenyuxuan/.conda/envs//.conda/envs/LLM2Vec/lib/python3.8/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/share/home/.conda/envs/LLM2Vec/lib/python3.8/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/share/home/.conda/envs/LLM2Vec/lib/python3.8/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/share/home/.conda/envs/LLM2Vec/lib/python3.8/site-packages/datasets/builder.py", line 2038, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Georgepitt commented 3 weeks ago

Okay, I solved the problem above. Mainly need to save as txt file, and set offline mode.

If I want to do bi+mntp training on another model how do I modify the parameters and set the bidirectional model?

# Loading bidirectional model using LLM2Vec package
    model_class = get_model_class(config)  
def get_model_class(config):
    config_class_name = config.__class__.__name__
    if config_class_name == "MistralConfig":
        return MistralBiForMNTP
    elif config_class_name == "LlamaConfig":
        return LlamaBiForMNTP
    else:
        raise ValueError(f"Model class {config_class_name} not supported.")
Georgepitt commented 3 weeks ago

I found the corresponding location of this error in this code

from llm2vec.models import MistralBiForMNTP, LlamaBiForMNTP 

I found the corresponding error in this code, which means that if I need to support another model like gemma, I need to write a new file. But in local /. Conda/envs/LLM2Vec/lib/python3.8 / site - packages/LLM2Vec/models/bidirectional_llama.py There are no comments in the file, so I don't quite understand how I need to modify it. Do you have any suggestions?

There is also a piece of code that needs to be networked, and I tried to read it locally by saving it as a pkl file, but failed.

metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir)

Modify code snippet

        accuracy_pkl_path = "/share/home/chenyuxuan/Research_CodeSearch/llm2v/llm2vec_mntp/accuracy_metric.pkl"
        if os.path.exists(accuracy_pkl_path):
            try:
                with open(accuracy_pkl_path, 'rb') as f:
                    metric = pickle.load(f)
                print("Loaded metric from accuracy_metric.pkl:", metric)
            except Exception as e:
                print(f"Error loading metric from .pkl file: {e}")
        else:
            metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir)

error logs

Error loading metric from .pkl file: No module named 'evaluate_modules'
06/06/2024 16:21:42 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
vaibhavad commented 3 weeks ago

Hi @Georgepitt,

So as I understand, now there are two remaining issues - supporting Gemma, and making sure the accuracy metric code runs locally. Is that correct?

For Gemma, we had actually implemented bidirectional version, but because we ended up not using Gemma for the paper, I did not include it in the code. I can start a PR for that.

Georgepitt commented 3 weeks ago

Thank you for your reply @vaibhavad . Yes, you are right! Now i have solve the accuracy metric code runs locally. I'd appreciate it,If you could start a PR for Gemma. Actually, I'm more interested in how I bidirectional other models, because I want to do multiple model comparisons. Can you give me some advice?From what perspective should I explore? A hugging face file from llms, a paper, or a direct modification of bidirectional_mistral-py.

vaibhavad commented 2 weeks ago

@Georgepitt - I have added Gemma.

If you are using Flash attention, then it is much more straightforward to make any model bidirectional. We discuss that in our tutorial.

For other attention mechanisms, unfortunately, the only way right now is to read the modeling_{model}.py file on transformers library and edit the part where causal mask is being formed. This is because models differ slightly on how they implement the causal mask and it is not standardized yet.

Georgepitt commented 2 days ago

Thank you for your help. It is useful.But when I try Supervised contrastive training in Gemma after finish the MNTP training , It is a reminder that GemmaConfig is not supported yet with bidirectional models. How can I solve this problem? Thank you very much for your help. The following is the corresponding error log.

Error log

2024-06-28 13:58:11 - llm2vec.dataset.CSNData - INFO - Skip 1 batch for dataset ruby.
2024-06-28 13:58:11 - llm2vec.dataset.CSNData - INFO - Loaded 167424 samples.

Loading train examples...:   0%|          | 0/167424 [00:00<?, ?it/s]
Loading train examples...:   2%|▏         | 3505/167424 [00:00<00:06, 24362.62it/s]
Loading train examples...:  27%|██▋       | 45797/167424 [00:00<00:00, 133968.60it/s]
Loading train examples...:  59%|█████▉    | 99304/167424 [00:00<00:00, 168983.34it/s]
Loading train examples...:  98%|█████████▊| 164281/167424 [00:00<00:00, 275492.25it/s]
Loading train examples...: 100%|██████████| 167424/167424 [00:00<00:00, 175089.47it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/share/home/chenyuxuan/Research_CodeSearch/llm2v/llm2vec_Supervised_contrastive_training/run_test/run_supervised.py", line 511, in <module>
[rank0]:     main()
[rank0]:   File "/share/home/chenyuxuan/Research_CodeSearch/llm2v/llm2vec_Supervised_contrastive_training/run_test/run_supervised.py", line 448, in main
[rank0]:     model = LLM2Vec.from_pretrained(
[rank0]:   File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/llm2vec/llm2vec.py", line 93, in from_pretrained
[rank0]:     model_class = cls._get_model_class(
[rank0]:   File "/share/home/chenyuxuan/.conda/envs/LLM2Vec/lib/python3.8/site-packages/llm2vec/llm2vec.py", line 67, in _get_model_class
[rank0]:     raise ValueError(
[rank0]: ValueError: GemmaConfig is not supported yet with bidirectional models.
E0628 13:58:19.378987 140501249107776 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3053) of binary: /share/home/chenyuxuan/.conda/envs/LLM2Vec/bin/python