TypeError: LlamaBiModel._update_causal_mask() takes from 4 to 5 positional arguments but 6 were given

Hi,

When I run this script python experiments/run_mntp.py trainconfigs/mntp/MetaLlama3.json, there is an error happened. | | | | ||| ||| ||| | | ||| |||| || ||| ||||
| | | | | | | || | | | | | | |
|||| | | | || | || | | | | | || ||| |||| | |||
| | | | | | | | | | || | | | | | | |
| | || ||| ||| ||| | | ||| | | | ||| |||_|

A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.                                   
Setting a new token will erase the existing one.                                                                                                                                     
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .

Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /home/changge/.cache/huggingface/token
Login successful
/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/training_args.py:1474: FutureWarning: evaluation_strategy is deprecated and will be removed in ve rsion 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
05/29/2024 19:17:13 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 4, distributed training: False, 16-bits training: False
05/29/2024 19:17:13 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=4,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=100,
eval_strategy=IntervalStrategy.STEPS,
evaluation_strategy=steps,
fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=output/mntp/Meta-Llama-3-8B-Instruct/runs/May29_19-17-13_sn4622119311, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_kwargs={}, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=OptimizerNames.ADAMW_TORCH, optim_args=None, optim_target_modules=None, output_dir=output/mntp/Meta-Llama-3-8B-Instruct, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=output/mntp/Meta-Llama-3-8B-Instruct, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) Overwrite dataset info from restored data version if exists. [143/321] 05/29/2024 19:17:15 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3 05/29/2024 19:17:15 - INFO - datasets.info - Loading Dataset info from /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d68 5c3 Found cached dataset wikitext (/home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3) 05/29/2024 19:17:15 - INFO - datasets.builder - Found cached dataset wikitext (/home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee7 1d232d685c3) Loading Dataset info from /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3 05/29/2024 19:17:15 - INFO - datasets.info - Loading Dataset info from /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d68 5c3 /data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in ver sion 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 4.98MB/s] [INFO|configuration_utils.py:733] 2024-05-29 19:17:15,651 >> loading configuration file config.json from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8 B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/config.json [INFO|configuration_utils.py:796] 2024-05-29 19:17:15,652 >> Model config LlamaConfig { "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128009, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.41.0", "use_cache": true, "vocab_size": 128256 } tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 5.15MB/s] tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 27.3MB/s] special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 624kB/s] [INFO|tokenization_utils_base.py:2108] 2024-05-29 19:17:16,306 >> loading file tokenizer.json from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Inst ruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/tokenizer.json [INFO|tokenization_utils_base.py:2108] 2024-05-29 19:17:16,306 >> loading file added_tokens.json from cache at None [INFO|tokenization_utils_base.py:2108] 2024-05-29 19:17:16,306 >> loading file special_tokens_map.json from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Meta-Llama- 3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/special_tokens_map.json [INFO|tokenization_utils_base.py:2108] 2024-05-29 19:17:16,306 >> loading file tokenizer_config.json from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3- 8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/tokenizer_config.json [WARNING|logging.py:314] 2024-05-29 19:17:16,564 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 115MB/s] [INFO|modeling_utils.py:3474] 2024-05-29 19:17:16,710 >> loading weights file model.safetensors from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-In struct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/model.safetensors.index.json model-00001-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:46<00:00, 106MB/s] model-00002-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:47<00:00, 105MB/s] model-00003-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:45<00:00, 108MB/s] model-00004-of-00004.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:10<00:00, 111MB/s] Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:31<00:00, 37.75s/it] [INFO|modeling_utils.py:1519] 2024-05-29 19:19:47,712 >> Instantiating LlamaBiForMNTP model under default dtype torch.bfloat16. [WARNING|logging.py:329] 2024-05-29 19:19:47,715 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializ ing it on CPU with model.to('cuda'). [INFO|configuration_utils.py:962] 2024-05-29 19:19:47,717 >> Generate config GenerationConfig { "bos_token_id": 128000, "eos_token_id": 128009 }

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 3.49it/s] [INFO|modeling_utils.py:4280] 2024-05-29 19:19:48,957 >> All model checkpoint weights were used when initializing LlamaBiForMNTP.

[INFO|modeling_utils.py:4288] 2024-05-29 19:19:48,957 >> All the weights of LlamaBiForMNTP were initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaBiForMNTP for predictions without further training. generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 187/187 [00:00<00:00, 2.59MB/s] [INFO|configuration_utils.py:917] 2024-05-29 19:19:49,050 >> loading configuration file generation_config.json from cache at /home/changge/.cache/huggingface/hub/models--meta-llama--Met a-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/generation_config.json [INFO|configuration_utils.py:962] 2024-05-29 19:19:49,050 >> Generate config GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [ 128001, 128009 ], "max_length": 4096, "temperature": 0.6, "top_p": 0.9 } Model's Lora trainable parameters: trainable params: 41,943,040 || all params: 7,546,867,712 || trainable%: 0.5558 Running tokenizer on every text in dataset: 0%| | 0/4358 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-c66962c78cb5529c.arrow 05/29/2024 19:19:49 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-c66962c78cb5529c.arrow Running tokenizer on every text in dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4358/4358 [00:00<00:00, 23444.76 examples/s] Running tokenizer on every text in dataset: 0%| | 0/1801350 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-655adfdce63258ce.arrow 05/29/2024 19:19:49 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-655adfdce63258ce.arrow Running tokenizer on every text in dataset: 100%|████████████████████████████████████████████████████████████████████████████████████| 1801350/1801350 [01:23<00:00, 21494.56 examples/s] Running tokenizer on every text in dataset: 0%| | 0/3760 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-06c641e9a21337eb.arrow 05/29/2024 19:21:13 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-06c641e9a21337eb.arrow Running tokenizer on every text in dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3760/3760 [00:00<00:00, 22418.88 examples/s] Grouping texts in chunks of 512: 0%| | 0/4358 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-97b8af2be572f3da.arrow 05/29/2024 19:21:13 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-97b8af2be572f3da.arrow Grouping texts in chunks of 512: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4358/4358 [00:00<00:00, 16710.13 examples/s] Grouping texts in chunks of 512: 0%| | 0/1801350 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-91f1e93e8c2532e8.arrow 05/29/2024 19:21:14 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-91f1e93e8c2532e8.arrow Grouping texts in chunks of 512: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1801350/1801350 [01:47<00:00, 16736.33 examples/s] Grouping texts in chunks of 512: 0%| | 0/3760 [00:00<?, ? examples/s] Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625aee71d232d685c3/cache-c698bf63ec328b6b.arrow 05/29/2024 19:23:01 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/changge/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/0.0.0/b08601e04326c79dfdd32d625 aee71d232d685c3/cache-c698bf63ec328b6b.arrow Grouping texts in chunks of 512: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3760/3760 [00:00<00:00, 16511.18 examples/s] /data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/utils/import_utils.py:533: FutureWarning: is_torch_tpu_available is deprecated and will be remove d in 4.41.0. Please use the is_torch_xla_available instead. warnings.warn( [INFO|trainer.py:2078] 2024-05-29 19:23:03,549 >> Running training [INFO|trainer.py:2079] 2024-05-29 19:23:03,549 >> Num examples = 237,180 [INFO|trainer.py:2080] 2024-05-29 19:23:03,549 >> Num Epochs = 3 [INFO|trainer.py:2081] 2024-05-29 19:23:03,549 >> Instantaneous batch size per device = 32 [INFO|trainer.py:2083] 2024-05-29 19:23:03,549 >> Training with DataParallel so batch size has been adjusted to: 128 [INFO|trainer.py:2084] 2024-05-29 19:23:03,549 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:2085] 2024-05-29 19:23:03,549 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2086] 2024-05-29 19:23:03,549 >> Total optimization steps = 5,559 [INFO|trainer.py:2087] 2024-05-29 19:23:03,552 >> Number of trainable parameters = 567,279,616 0%| | 0/5559 [00:00<?, ?it/s] [WARNING|logging.py:329] 2024-05-29 19:23:39,022 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. Traceback (most recent call last): [3/321] File "/data/changge/project/llm2vec/experiments/run_mntp.py", line 982, in main() File "/data/changge/project/llm2vec/experiments/run_mntp.py", line 930, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/trainer.py", line 3264, in compute_loss outputs = model(inputs) ^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward outputs = self.parallel_apply(replicas, inputs, module_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply output.reraise() File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/_utils.py", line 705, in reraise raise exception TypeError: Caught TypeError in replica 0 on device 0. Original Traceback (most recent call last): File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker output = module(*input, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 1164, in forward outputs = self.model( ^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/peft/peft_model.py", line 642, in forward return self.get_base_model()(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/changge/software/anaconda/envs/ll2vec/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 940, in forward causal_mask = self._update_causal_mask( ^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: LlamaBiModel._update_causal_mask() takes from 4 to 5 positional arguments but 6 were given

McGill-NLP / llm2vec

TypeError: LlamaBiModel._update_causal_mask() takes from 4 to 5 positional arguments but 6 were given #87