HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
635 stars 59 forks source link

Error in Self-Supervised Instruction Tuning #17

Closed aiwen7 closed 11 months ago

aiwen7 commented 1 year ago

Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
    return super(GraphLlamaModel, self).forward(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
    output_unpad = flash_attn_unpadded_qkvpacked_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
    return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
    qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM

I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:

Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
    node_forward_out = graph_tower(g)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
    device = self.parameters().__next__().device
StopIteration

Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!

W-rudder commented 1 year ago

Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
    return super(GraphLlamaModel, self).forward(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
    output_unpad = flash_attn_unpadded_qkvpacked_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
    return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
    qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM

I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:

Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
    node_forward_out = graph_tower(g)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
    device = self.parameters().__next__().device
StopIteration

Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!

excuse me, have u solved this problem?

tjb-tech commented 11 months ago

Thank you for your interest in our GraphGPT. I apologize for the delayed response due to the academic workload at the end of the semester. Maybe you can fix this error by commentting replace_llama_attn_with_flash_attn() in line 8 in https://github.com/HKUDS/GraphGPT/blob/main/graphgpt/train/train_mem.py. And it could be:

# Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.

# Need to call this before importing transformers.
from graphgpt.train.llama_flash_attn_monkey_patch import (
    replace_llama_attn_with_flash_attn,
)

# replace_llama_attn_with_flash_attn()

from graphgpt.train.train_graph import train

if __name__ == "__main__":
    train()

If that doesn't work, feel free to ask me further. Wishing you an early Merry Christmas!

tjb-tech commented 11 months ago

Hi there, thanks for offering this interesting project! I have trouble when conducting the Self-Supervised Instruction Tuning. Specifically, the error goes as follows:

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [29074,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 277, in forward
    return super(GraphLlamaModel, self).forward(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 912, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/llama_flash_attn_monkey_patch.py", line 88, in forward
    output_unpad = flash_attn_unpadded_qkvpacked_func(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 256, in flash_attn_unpadded_qkvpacked_func
    return FlashAttnQKVPackedFunc.apply(qkv, cu_seqlens, max_seqlen, dropout_p, softmax_scale,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 59, in forward
    qkv[:, 0], qkv[:, 1], qkv[:, 2], torch.empty_like(qkv[:, 0]), cu_seqlens, cu_seqlens,
RuntimeError: CUDA error: device-side assert triggered
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510117 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510118 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 510120 closing signal SIGTERM

I use the suggested configurations (environments, scripts) and conduct the tuning on a Linux server equipped with 4 A100 in a distributed manner. Still, I have also tried to conduct the tuning on one GPU merely. To avoid CUDA OOM error, I have modified the train/eval batch size to 1. However, I have encountered another error as follows:

Token indices sequence length is longer than the specified maximum sequence length for this model (3338 > 2048). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_mem.py", line 15, in <module>
    train()
  File "/home/k/lgm/graphGPT-main/graphgpt/train/train_graph.py", line 943, in train
    trainer.train()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 332, in forward
    outputs = self.model(
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/GraphLlama.py", line 209, in forward
    node_forward_out = graph_tower(g)
  File "/home/k/anaconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/k/lgm/graphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
    device = self.parameters().__next__().device
StopIteration

Therefore, the tuning process can not be reproduced on either single or multiple GPUs. Any suggestions for troubleshooting would be appreciated. Looking forward to your kind reply!

excuse me, have u solved this problem?

Thank you for your attention. Please refer to my reply above.

octopusStar218 commented 6 months ago

We ran into the same problem, although we commented out "replace_llama_attn_with_flash_attn()", but still got an error:

`['model.embed_tokens.weight', 'model.graph_projector.weight', 'model.graph_projector.bias'] 0%| | 0/137400 [00:00<?, ?it/s]Traceback (most recent call last): File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/accelerate/accelerator.py", line 1058, in accumulate yield File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 3264, in compute_loss outputs = model(inputs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward outputs = self.parallel_apply(replicas, inputs, module_kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply output.reraise() File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/_utils.py", line 705, in reraise raise exception StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker output = module(*input, kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/home/graph_learning/GraphGPT-main/graphgpt/model/GraphLlama.py", line 325, in forward outputs = self.model( File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/home/graph_learning/GraphGPT-main/graphgpt/model/GraphLlama.py", line 202, in forward node_forward_out = graph_tower(g) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/graph_learning/GraphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward device = self.parameters().next().device StopIteration

0%| | 0/137400 [02:11<?, ?it/s] `