ValueError: vocab nid config conflict

anesaibr commented 2 months ago

Hi,

I have been refering to the ONCE project to test and run it on a news dataset ,namely, Ekstra Bladet News Recommendation Dataset. However, me and my team seem to be getting the error attached below when trying to run the Training Preparation (for tuning LLaMA which makes use of the configuration files of this repository. The below ouput is the generated configuration dict which is printed at the beginning of the output :

Moreover this is the following error that gets printed when running the [Main file (worker.py)]: (https://github.com/Jyonn/Legommenders/blob/4e8878e8188dc1fc880e9db055a800b5736d19df/worker.py) File "/gpfs/home3/scur1569/Legommenders/worker.py", line 488, in worker = Worker(config=configuration) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/gpfs/home3/scur1569/Legommenders/worker.py", line 58, in init self.controller = Controller( ^^^^^^^^^^^ File "/gpfs/home3/scur1569/Legommenders/loader/controller.py", line 47, in init self.depots = Depots(user_data=self.data.user, modes=self.modes, column_map=self.column_map) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/gpfs/home3/scur1569/Legommenders/loader/depots.py", line 37, in init depot.union(*[DepotHub.get(d) for d in user_data.union]) File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/UniTok/unidep.py", line 221, in union self.vocs = self._merge_vocs(self.vocs, depot.vocs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/UniTok/unidep.py", line 189, in _merge_vocs raise ValueError(f'vocab {name} config conflict') ValueError: vocab nid config conflict


[00:00:00] |Worker| {
    "embed": {
        "name": "llama-token",
        "embeddings": [
            {
                "vocab_name": "llama",
                "vocab_type": "numpy",
                "path": "data/llama-token.npy",
                "frozen": true
            }
        ]
    },
    "model": {
        "name": "LLAMA-NAML.D64.L0.Lora0",
        "meta": {
            "item": "Llama",
            "user": "Ada",
            "predictor": "Dot"
        },
        "config": {
            "use_neg_sampling": true,
            "use_item_content": true,
            "max_item_content_batch_size": 0,
            "same_dim_transform": false,
            "embed_hidden_size": 4096,
            "hidden_size": 64,
            "neg_count": 4,
            "item_config": {
                "llm_dir": "/home/data1/qijiong/llama-7b",
                "layer_split": 0,
                "lora": 0,
                "weights_dir": "data/ebnerd_small_tokenized-Llama/llama-7b-split"
            },
            "user_config": {
                "num_attention_heads": 12,
                "inputer_config": {
                    "use_cls_token": false,
                    "use_sep_token": false
                }
            }
        }
    },
    "exp": {
        "name": "test_llm_layer_split",
        "dir": "saving/ebnerd_small_tokenized-Llama/LLAMA-NAML.D64.L0.Lora0/llama-token-test_llm_layer_split",
        "log": "saving/ebnerd_small_tokenized-Llama/LLAMA-NAML.D64.L0.Lora0/llama-token-test_llm_layer_split/exp.log",
        "mode": "test_llm_layer_split",
        "store": {
            "layers": [
                31,
                30,
                29,
                27
            ],
            "dir": "data/ebnerd_small_tokenized-Llama/llama-7b-split"
        },
        "load": {
            "save_dir": null,
            "model_only": true,
            "strict": true,
            "wait": false
        },
        "policy": {
            "device": "gpu",
            "batch_size": 64
        }
    },
    "data": {
        "name": "ebnerd_small_tokenized-Llama",
        "base_dir": "ebnerd_small_tokenized",
        "item": {
            "filter_cache": true,
            "depot": "ebnerd_small_tokenized/news",
            "order": [
                "title",
                "cat"
            ],
            "append": [
                "nid"
            ]
        },
        "user": {
            "filter_cache": true,
            "depots": {
                "train": {
                    "path": "ebnerd_small_tokenized/train"
                },
                "dev": {
                    "path": "ebnerd_small_tokenized/validation"
                }
            },
            "filters": {
                "history": [
                    "x"
                ]
            },
            "union": [
                "ebnerd_small_tokenized/user"
            ],
            "candidate_col": "nid",
            "clicks_col": "history",
            "label_col": "click",
            "neg_col": "neg",
            "group_col": "imp",
            "user_col": "uid",
            "index_col": "index"
        }
    },
    "version": "small",
    "llm_ver": "7b",
    "hidden_size": 64,
    "layer": 0,
    "lora": 0,
    "fast_eval": 0,
    "embed_hidden_size": 4096,
    "warmup": 0,
    "simple_dev": false,
    "batch_size": 64,
    "acc_batch": 1,
    "lora_r": 32,
    "lr": 0.0001,
    "item_lr": 1e-05,
    "mind_large_submission": false,
    "epoch_batch": 0,
    "max_item_batch_size": 0,
    "page_size": 512,
    "patience": 2,
    "epoch_start": 0,
    "frozen": true,
    "load_path": null,
    "rand": {},
    "time": {},
    "seed": 2023
}```

Jyonn commented 2 months ago

Hi,

Thanks for your attention to our work! Using Legommenders to tune with Llama on a new dataset is a complex work. I hope I can help you make it.

The entire pipeline is as follows:

Tokenize the new dataset with UniTok.
Train new dataset on simple models such as NAML.
Train new dataset on LLMs.

For now, I think the error occurs on the first step: tokenization.

I guess you followed the scripts in the process folder. It seems you first tokenze news data and save news IDs into nid vocabulary. Next, when you tokenize interaction/user-sequence data, you also provide this nid vocabulary. However, I guess there is unrecoginized news in your interaction/user-sequence data, so the nid vocab is then extended to compat this data. Therefore, the later nid vocab is larger than the original one.

In this case, I suggest you first filtering out all the invalid nids.

Feel free to ask any questions, thanks.

anesaibr commented 2 months ago

Thank you for your quick reply! Me and my team managed to fix the tokenization error and adjust the nid vocab to fit with our data. However, we have now progressed to a new error which comes from the configuration.embed file since we are not sure how to generate the npy files which are present in the path variable :

Jyonn commented 2 months ago

Hi, if you are using llama, please extract llama token embeddings first.

You can refer to the following scripts:

import numpy as np
from transformers import LlamaModel

pretrained_dir = '/path/to/llama/'
device = 'cuda:1'

# Load pre-trained model (weights)
model = LlamaModel.from_pretrained(pretrained_dir).to(device)  # type: LlamaModel
print(len(model.layers))

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

embeds = model.embed_tokens.weight.cpu().detach().numpy()
np.save('llama-token.npy', embeds)

Jyonn commented 2 months ago

BTW, if you meet OOM problem, please add --page_size 64 in the command line.

anesaibr commented 2 months ago

We are actually experiencing another issue when wanting to tune LLama(for training preparation) since our dataset does not contain a test folder (only train & dev). More specifically, we are not able to use test_llm_layer_split inside the llama-split.yaml file and assumed we could use the other splits invloving train and dev sets. However, this currently gives an error which complains about a missing learning rate I suppose. Any suggestion or solution to this current error?


  File "/gpfs/home3/scur1569/Legommenders/worker.py", line 489, in <module>
    worker.run()
  File "/gpfs/home3/scur1569/Legommenders/worker.py", line 441, in run
    epoch = self.train_runner()
            ^^^^^^^^^^^^^^^^^^^
  File "/gpfs/home3/scur1569/Legommenders/worker.py", line 366, in train_runner
    pnt('use single lr:', self.exp.policy.lr)
  File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/pigmento/pigmento.py", line 83, in __call__
    return self._call(*args, _caller_name=caller_name, _caller_class=caller_class, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/pigmento/pigmento.py", line 120, in _call
    text = ' '.join([str(arg) for arg in args])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/pigmento/pigmento.py", line 120, in <listcomp>
    text = ' '.join([str(arg) for arg in args])
                     ^^^^^^^^
  File "/home/scur1569/.conda/envs/dire_tokenize/lib/python3.11/site-packages/oba/oba.py", line 34, in __str__
    raise ValueError(f'Path {NoneObj.raw(self)} not exists')
ValueError: Path lr not exists```

Jyonn commented 2 months ago

Hi, please use test_llm_layer_split. In the meantime, in your data/eb-nerd.yaml, please set test depot path as same as dev depot path.

Jyonn commented 2 months ago

Are you participating in the RecSys challenge? If you need instant assistance these days, we can add each other on WeChat, Telegram, or WhatsApp.

anesaibr commented 2 months ago

We are participating in the Recsys challenge indeed! It would be great if we would keep in contact via WhatsApp! Can you perhaps share your number and I would then add you to our WhatsApp group?

Jyonn / Legommenders

ValueError: vocab nid config conflict #7