OOM during Training Preparation for Llama #6

Closed ChangyuChen347 closed 4 months ago

ChangyuChen347 commented 4 months ago

Thank you for your help at https://github.com/Jyonn/ONCE/issues/4. I am now able to successfully run the code. However, I constantly encounter out-of-memory errors during Training Preparation, even when using an A100 with 80GB. I've tried setting both the batch_size and max_item_batch_size to 1, but it seems that the actual running batch size remains at 512.

My config: [00:00:00] |Worker| { "embed": { "name": "llama-token", "embeddings": [ { "vocab_name": "llama", "vocab_type": "numpy", "path": "llama_emb.npy", "frozen": true } ] }, "model": { "name": "LLAMA-NRMS.D64.L0.Lora0", "meta": { "item": "Llama", "user": "Attention", "predictor": "Dot" }, "config": { "use_neg_sampling": true, "use_item_content": true, "max_item_content_batch_size": 0, "same_dim_transform": false, "embed_hidden_size": 4096, "hidden_size": 64, "neg_count": 4, "item_config": { "llm_dir": "/mnt/data_large/ccy/Llama-2-7b-hf", "layer_split": 0, "lora": 0,
"weights_dir": "data/MIND-small-Llama/llama-7b-split" }, "user_config": { "num_attention_heads": 8, "inputer_config": { "use_cls_token": false, "use_sep_token": false } } } }, "exp": { "name": "test_llm_layer_split", "dir": "saving/MIND-small-Llama/LLAMA-NRMS.D64.L0.Lora0/llama-token-test_llm_layer_split", "log": "saving/MIND-small-Llama/LLAMA-NRMS.D64.L0.Lora0/llama-token-test_llm_layer_split/exp.log", "mode": "test_llm_layer_split", "store": { "layers": [ 31, 30, 29, 27 ], "dir": "data/MIND-small-Llama/llama-7b-split" }, "load": { "save_dir": null, "model_only": true, "strict": true, "wait": false }, "policy": { "device": "gpu", "batch_size": 1 } },
"data": { "name": "MIND-small-Llama", "base_dir": "data/MIND-small", "item": { "filter_cache": true, "depot": "data/MIND-small/news", "order": [ "title-llama", "cat-llama" ], "append": [ "nid" ], "lm_col": "title-llama" }, "user": { "filter_cache": true, "depots": { "train": { "path": "data/MIND-small/train" }, "dev": { "path": "data/MIND-small/dev" }, "test": { "path": "data/MIND-small/test" } }, "filters": { "history": [ "x" ] }, "union": [ "data/MIND-small/user" ], "candidate_col": "nid", "clicks_col": "history", "label_col": "click", "neg_col": "neg", "group_col": "imp", "user_col": "uid", "index_col": "index" } }, "version": "small", "llm_ver": "7b", "hidden_size": 64, "layer": 0, "lora": 0, "fast_eval": 0, "embed_hidden_size": 4096, "max_news_batch_size": 1, "max_item_batch_size": 1, "batch_size": 1, "warmup": 0, "simple_dev": false, "acc_batch": 1, "lora_r": 32, "lr": 0.0001, "item_lr": 1e-05, "mind_large_submission": false, "epoch_batch": 0, "page_size": 512, "patience": 2, "epoch_start": 0, "frozen": true, "load_path": null, "rand": {}, "time": {}, "seed": 2023 }

Jyonn commented 4 months ago


You may add max_item_content_batch_size: 512 to the model config, not exp config. Batch size can be larger like 64.

ChangyuChen347 commented 4 months ago

The max_item_content_batch_size is 32. But I still experience OOM. Then I print the shape of hidden_states, and it is torch.Size([512, 33, 4096]) but not 32.

name: LLAMA-NRMS.D${model.config.hidden_size}.L${model.config.item_config.layer_split}.Lora${model.config.item_config.lora} meta: item: Llama user: Attention predictor: Dot config: use_neg_sampling: true use_item_content: true max_item_content_batch_size: 32

Jyonn commented 4 months ago


Sorry, please add --page_size 64 in the command. I will fix this issue as soon as possible in the documentation. I also found that ColPromptMap in the natural_concat_inputer did not support current columns like title-llama. Now I have fixed this issue and you can update the code via git pull. I hope these issues did not affect your research progress.


ChangyuChen347 commented 4 months ago

Hi, I have successfully run it now. However, my current results are slightly lower than those reported in the paper. Do you have any suggestions regarding the configuration?

python worker.py --data config/data/mind-llama.yaml --embed config/embed/llama-token.yaml --model config/model/llm/llama-nrms.yaml --exp config/exp/tt-llm.yaml --embed_hidden_size 4096 --llm_ver 7b --layer 31 --version small --lr 0.0001 --item_lr 0.00001 --batch_size 64 --acc_batch 1 --epoch_batch -4

Jyonn commented 4 months ago

Hi, there are several hyperparameters that can be tuned:

Model agnostic parameters:

Model specific parameters:

Please refer to config/model/llm/llama-nrms.yaml and modify the hyperparameters of the attention module