How to run models? - Githubissues

chiyuzhang94 commented 8 months ago

Hi @Jyonn ,

I would like to ask how to run the models (such as NRMS) with your tool? Any guidance?

Best, Chiyu

Jyonn commented 8 months ago

Hi Chiyu, you can try the following command.

python worker.py
 --data config/data/<dataset>.yaml 
 --embed config/embed/null.yaml 
 --model config/model/lego_nrms.yaml 
 --exp config/exp/tt-naml.yaml  
 --embed_hidden_size 64 
 --hidden_size 64 
 --lr 0.001 
 --batch_size 256

chiyuzhang94 commented 8 months ago

Thanks.

I wonder what format the dataset should be look like. Does the code support multi-gpu training?

Jyonn commented 8 months ago

You can download the dataset from here. Currently, we only support single GPU running.

chiyuzhang94 commented 7 months ago

You can download the dataset from here. Currently, we only support single GPU running.

Hi @Jyonn

I got a question about how to covert to my tsv files into your data format for training because I have different datasets. I know we had some discussions about the data preprocessing but I wonder what the scripts used to generate the final dataset for training model.

Jyonn commented 7 months ago

Hi Chiyu,

You can refer to here, which is a data preprocessing pipeline for converting the MIND data (csv file) to our format.

Basically, you should prepare several dataframes, such as item df, user history df, and interaction (train/dev/test) df. For each df, you are required to prepare a unitok to tokenize this part of data.

chiyuzhang94 commented 7 months ago

You can download the dataset from here. Currently, we only support single GPU running.

Hi @Jyonn ,

I am trying run the code with your orginal dataset from the Google drive. Here is my command. However, it failed to find 'data/MIND-small/neg/meta.data.json'. I cannot see this neg folder in your data. Can you take a look at this issue?

python worker.py \
 --version small \
 --data config/data/mind.yaml \
 --embed config/embed/null.yaml \
 --model config/model/lego_nrms.yaml \
 --exp config/exp/tt-naml.yaml \
 --embed_hidden_size 64 \
 --hidden_size 64 \
 --lr 0.001 \
 --batch_size 256

Error:

  File "/miniconda3/envs/py310/lib/python3.10/site-packages/UniTok/meta.py", line 77, in __init__
    data = self.load()
  File "/miniconda3/envs/py310/lib/python3.10/site-packages/UniTok/meta.py", line 110, in load
    return json.load(open(self.path))
FileNotFoundError: [Errno 2] No such file or directory: 'data/MIND-small/neg/meta.data.json'

I found your yaml file specified this:

 union:
    - ${data.base_dir}/user-grp
    - ${data.base_dir}/neg

But these are not in your provided dataset.

Jyonn commented 7 months ago

Hi Chiyu,

I guess I have combined the user-grp and neg data into the unified user folder. Please rewrite this snippet into:

union:
    - ${data.base_dir}/user

Sorry for the inconvenience.

chiyuzhang94 commented 7 months ago

Thanks. I passed this issue but got a new error:

Traceback (most recent call last):
  File "/data/home/chiyu/Legommenders/worker.py", line 484, in <module>
    worker = Worker(config=configuration)
  File "/data/home/chiyu/Legommenders/worker.py", line 58, in __init__
    self.controller = Controller(
  File "/data/home/chiyu/Legommenders/loader/controller.py", line 91, in __init__
    self.embedding_hub.register_depot(self.item_hub)
  File "/data/home/chiyu/Legommenders/loader/embedding/embedding_hub.py", line 127, in register_depot
    vocab_name = depot.get_vocab(col)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/UniTok/unidep.py", line 499, in get_vocab
    return self.cols[col_name].voc.name
KeyError: 'title'

Jyonn commented 7 months ago

Hi Chiyu,

The published dataset contains news attributes tokenized by diverse tokenizers. You can use unitok <path> command to have a look at all the columns provided by the published data if your unitok version is 3.5.1 or above.

        UniDep (2.0): .

        Sample Size: 65238
        Id Column: nid
        Columns:
                nid, vocab nid (size 65238)
                title-bert, vocab bert (size 30522), max length 20
                abs-bert, vocab bert (size 30522), max length 50
                summarizer-bert, vocab bert (size 30522), max length 25
                cat-token, vocab cat (size 18)
                subcat-token, vocab subcat (size 270)
                cat-bert, vocab bert (size 30522), max length 4
                subcat-bert, vocab bert (size 30522), max length 8
                cat-llama, vocab llama (size 32000), max length 4
                subcat-llama, vocab llama (size 32000), max length 10
                title-llama, vocab llama (size 32000), max length 20
                abs-llama, vocab llama (size 32000), max length 50
                summarizer-llama, vocab llama (size 32000), max length 25

In this case, you need to modify the data configuration like:

item:
  filter_cache: true
  depot: ${data.base_dir}/news
  order:
    - title-bert
    - cat-token
  append:
    - nid
  lm_col: title-bert  # add this line

chiyuzhang94 commented 7 months ago

Thanks. This can solved the issue.

I also got error of "fake_col: fake" and solved it by removing this line in the yaml file. I just wonder what this fake_col is for and if it is safe to remove it.

Jyonn commented 7 months ago

It is for other project. Yon can remove it.

chiyuzhang94 commented 7 months ago

Hi Chiyu, you can try the following command.

python worker.py
 --data config/data/<dataset>.yaml 
 --embed config/embed/null.yaml 
 --model config/model/lego_nrms.yaml 
 --exp config/exp/tt-naml.yaml  
 --embed_hidden_size 64 
 --hidden_size 64 
 --lr 0.001 
 --batch_size 256

Hi @Jyonn , I have question about running job with and without pretrained LMs. For example, how to make the configures for NRMS or NRMS-PLM?

Jyonn commented 7 months ago

Here are model configurations for NRMS and NRMS-BERT:

NRMS:

name: NRMS
meta:
  item: Attention
  user: Attention
  predictor: Dot
config:
  use_neg_sampling: true
  use_item_content: true
  hidden_size: ${hidden_size}$
  embed_hidden_size: ${embed_hidden_size}$
  neg_count: 4
  item_config:
    num_attention_heads: 8
    inputer_config:
      use_cls_token: false
      use_sep_token: true
  user_config:
    num_attention_heads: 8
    inputer_config:
      use_cls_token: false
      use_sep_token: true

NRMS-Bert:

name: NRMS-Bert
meta:
  item: Bert
  user: Attention
  predictor: Dot
config:
  use_news_content: true
  max_news_content_batch_size: 0
  same_dim_transform: false
  embed_hidden_size: ${embed_hidden_size}$
  hidden_size: ${hidden_size}$
  neg_count: 4
  news_config:
    llm_dir: bert-base-uncased
  user_config:
    num_attention_heads: 8
    inputer_config:
      use_cls_token: false
      use_sep_token: false

You can also add --item_lr 0.0001 in the command that indicates a slower learning rate of the PLMs, set a smaller batch_size if occuring OOM, and add --embed_hidden_size 768.

For now, we inherently (configuration-only) support LLaMA and Bert series. For other PLMs, you can refer to model/operators/bert_operator.py and design new operators.

chiyuzhang94 commented 7 months ago

Hi @Jyonn ,

I have a question about saving model predictions. Now, I have done the model training. I wonder how to load the checkpoint, run on test set, and save the model predictions. Also, at the end of training, you evaluate on the test set, can we save both predictions and gold labels to a file?

Jyonn commented 7 months ago

Hi Chiyu,

To test the model only, please use --exp exp/test.yaml. It will automatically load the best model.

To save the predictions, you can write a new method based on the evaluate method on the Worker class. Save score_series to the file in your new method, and call it in the test method.

Thanks for your advise. This feature would probably be involved in our next major update.

chiyuzhang94 commented 7 months ago

Hi @Jyonn

I wonder how to run fastformer. I cannot find any specific experiment configure for fastformer in exp. Which exp configure should I use?

I also wonder what these "plmnr-*" models in model/llm/. I do not see they use any PLM in the configure file. Any comments?

Jyonn commented 7 months ago

Fastformer: --model config/model/lego_fastformer.yaml. You can use --exp tt-naml.yaml.

PLMNR-* are early version configurations. You can still use them, but they only train transformer network from scratch with specific number of layers. If you want to load pretrained model, please use BERT-* series.

chiyuzhang94 commented 7 months ago

Got it. Thanks.

For BERT-* series, I wonder how I should modify your yaml file to finetune the whole PLM without using LoRA. Could you give any guidance?

Jyonn commented 7 months ago

You can use --lora 0 --layer 0.

chiyuzhang94 commented 7 months ago

--layer 0

Hi @Jyonn ,

I tried to run BERT-NRMS, but encountered some argument errors. I modified the yaml file as follows. However, even with the batch size set to 1, I still encountered a GPU Out Of Memory (OOM) error. Any suggestions?

name: BERT-NRMS.D${model.config.hidden_size}.L${model.config.news_config.layer_split}.Lora${model.config.news_config.lora}
meta:
  item: Bert
  user: Attention
  predictor: Dot
config:
  use_news_content: true
  use_neg_sampling: true
  use_item_content: true
  use_fast_eval: ${fast_eval}$
  max_news_content_batch_size: 0
  same_dim_transform: false
  embed_hidden_size: ${embed_hidden_size}$
  hidden_size: ${hidden_size}$
  neg_count: 4
  news_config:
    llm_dir: bert-base-uncased
    layer_split: ${layer}$
    lora: ${lora}$
    lora_r: ${lora_r}$
  item_config:
    llm_dir: bert-base-uncased
    layer_split: ${layer}$
    lora: ${lora}$
    lora_r: ${lora_r}$
  user_config:
    num_attention_heads: 8
    inputer_config:
      use_cls_token: false
      use_sep_token: false

Jyonn commented 7 months ago

Hi Chiyu,

Which GPU device do you use? You may add max_item_content_batch_size: 64 # or 512 to the model config. But it is not possible even if the batch size is set to 1. Could you provide the entire config dict which is printed at the very beginning in [00:00:00].

Jyonn commented 7 months ago

BTW, you should extract BERT word embedding to a numpy file, and use --embed config/embed/bert-token.yaml.

name: bert-token
embeddings:
  -
    vocab_name: bert
    vocab_type: numpy
    path: <path-to-embedding>
    frozen: ${frozen}$

chiyuzhang94 commented 7 months ago

BTW, you should extract BERT word embedding to a numpy file, and use --embed config/embed/bert-token.yaml.
name: bert-token
embeddings:
  -
    vocab_name: bert
    vocab_type: numpy
    path: <path-to-embedding>
    frozen: ${frozen}$

Do you mean put model.embeddings.word_embeddings.weight into a numpy file?

chiyuzhang94 commented 7 months ago

I am using A100 40GB. Here is the log:

[00:00:00] |Worker| python  worker.py --version small --data config/data/mind.yaml --embed config/embed/null.yaml --model config/model/llm/bert-nrms.yaml --exp config/exp/tt-nrms.yaml --embed config/embed/bert-token.yaml --embed_hidden_size 768 --hidden_size 64 --lora 0 --layer 0 --fast_eval false --lr 5e-5 --batch_size 1
[00:00:00] |Worker| {
    "version": "small",
    "data": {
        "name": "MIND-small",
        "base_dir": "data/MIND-small-v3",
        "item": {
            "filter_cache": true,
            "depot": "data/MIND-small-v3/news",
            "order": [
                "title",
                "cat"
            ],
            "append": [
                "nid"
            ]
        },
        "user": {
            "filter_cache": true,
            "depots": {
                "train": {
                    "path": "data/MIND-small-v3/train"
                },
                "dev": {
                    "path": "data/MIND-small-v3/dev"
                },
                "test": {
                    "path": "data/MIND-small-v3/test"
                }
            },
            "filters": {
                "history": [
                    "x"
                ]
            },
            "union": [
                "data/MIND-small-v3/user",
                "data/MIND-small-v3/neg"
            ],
            "candidate_col": "nid",
            "clicks_col": "history",
            "label_col": "click",
            "neg_col": "neg",
            "group_col": "imp",
            "user_col": "uid",
            "index_col": "index"
        }
    },
    "embed": {
        "name": "bert-token",
        "embeddings": [
            {
                "vocab_name": "bert",
                "vocab_type": "numpy",
                "path": "data/bert_base_embedding.npy",
                "frozen": true
            }
        ]
    },
    "model": {
        "name": "BERT-NRMS.D64.L0.Lora0",
        "meta": {
            "item": "Bert",
            "user": "Attention",
            "predictor": "Dot"
        },
        "config": {
            "use_news_content": true,
            "use_neg_sampling": true,
            "use_item_content": true,
            "use_fast_eval": false,
            "max_news_content_batch_size": 0,
            "max_item_content_batch_size": 64,
            "same_dim_transform": false,
            "embed_hidden_size": 768,
            "hidden_size": 64,
            "neg_count": 4,
            "news_config": {
                "llm_dir": "bert-base-uncased",
                "layer_split": 0,
                "lora": 0,
                "lora_r": 32
            },
            "item_config": {
                "llm_dir": "bert-base-uncased",
                "layer_split": 0,
                "lora": 0,
                "lora_r": 32
            },
            "user_config": {
                "num_attention_heads": 8,
                "inputer_config": {
                    "use_cls_token": false,
                    "use_sep_token": false
                }
            }
        }
    },
    "exp": {
        "name": "train_test",
        "dir": "saving/MIND-small/BERT-NRMS.D64.L0.Lora0/bert-token-train_test",
        "log": "saving/MIND-small/BERT-NRMS.D64.L0.Lora0/bert-token-train_test/exp.log",
        "mode": "train_test",
        "load": {
            "save_dir": null,
            "epochs": null,
            "model_only": true,
            "strict": true,
            "wait": false
        },
        "store": {
            "top": 1,
            "early_stop": 2
        },
        "policy": {
            "epoch_start": 0,
            "epoch": 50,
            "lr": 0.0005,
            "freeze_emb": false,
            "pin_memory": false,
            "batch_size": 200,
            "device": "gpu",
            "n_warmup": 0,
            "check_interval": -2,
            "simple_dev": true
        },
        "metrics": [
            "AUC",
            "GAUC",
            "MRR",
            "NDCG@5",
            "NDCG@10"
        ]
    },
    "embed_hidden_size": 768,
    "hidden_size": 64,
    "lora": 0,
    "layer": 0,
    "fast_eval": false,
    "lr": 5e-05,
    "batch_size": 1,
    "warmup": 0,
    "simple_dev": false,
    "acc_batch": 1,
    "lora_r": 32,
    "item_lr": 1e-05,
    "mind_large_submission": false,
    "epoch_batch": 0,
    "max_item_batch_size": 0,
    "page_size": 512,
    "patience": 2,
    "epoch_start": 0,
    "frozen": true,
    "load_path": null,
    "rand": {},
    "time": {},
    "seed": 2023
}
[00:00:00] |GPU| choose 0 GPU with 40508 / 40960 MB

Traceback (most recent call last):
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/worker.py", line 488, in <module>
    worker.run()
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/worker.py", line 441, in run
    epoch = self.train_runner()
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/worker.py", line 384, in train_runner
    return self.train()
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/worker.py", line 169, in train
    loss = self.legommender(batch=batch)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/model/legommender.py", line 210, in forward
    user_embeddings = self.get_user_content(batch)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/model/legommender.py", line 185, in get_user_content
    clicks = self.get_item_content(batch, self.clicks_col)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/model/legommender.py", line 171, in get_item_content
    content = self.item_encoder(item_content[start:end], mask=mask)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/chiyu/Legommenders/model/operators/base_llm_operator.py", line 103, in forward
    llm_output = self.transformer(
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 1020, in forward
    encoder_outputs = self.encoder(
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 610, in forward
    layer_outputs = layer_module(
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 495, in forward
    self_attention_outputs = self.attention(
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 425, in forward
    self_outputs = self.self(
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py", line 284, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/chiyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 2.81 MiB is free. Including non-PyTorch memory, this process has 39.55 GiB memory in use. Of the allocated memory 38.22 GiB is allocated by PyTorch, and 863.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jyonn commented 7 months ago

It seems you use tt-nrms.yaml as the exp configuration, which hard code batch size to 200.

My suggestions:

Use tt-naml.yaml for all matching-based model, and set a smaller batch size like 64.
The learning rate --lr 5e-5 is too small. 1e-3 is recommended. You can set --item_lr 5e-5 if you want a smaller learning rate for language models (item encoder).
Since your GPU memory is 40GB, max_item_content_batch_size can be larger, like 512.
Highly recommend to use --fast_eval 1 which will accelerate the evaluation phase.

chiyuzhang94 commented 7 months ago

Thanks @Jyonn ! It works now.

What is the difference between lr and item_lr?

Jyonn commented 7 months ago

When LLMs are used as the item encoder, item_lr will be activated, as those pretrained models may need smaller learning rate. lr is the learning rate for user encoder and predictor.

Jyonn / Legommenders

How to run models? #2