HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
493 stars 36 forks source link

StopIteration: Caught StopIteration in replica 0 on device 0. #72

Open 447972815 opened 2 months ago

447972815 commented 2 months ago

We ran into the same problem as issue#17, but we still got an error even though we had to comment out "replace_llama_attn_with_flash_attn()"

0%|                                                | 0/274797 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/accelerate/accelerator.py", line 1058, in accumulate
    yield
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 3238, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/transformers/trainer.py", line 3264, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 185, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 200, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/graph_learning/GraphGPT-main/graphgpt/model/GraphLlama.py", line 325, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/graph_learning/GraphGPT-main/graphgpt/model/GraphLlama.py", line 202, in forward
    node_forward_out = graph_tower(g)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/graphgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/graph_learning/GraphGPT-main/graphgpt/model/graph_layers/graph_transformer.py", line 64, in forward
    device = self.parameters().__next__().device
StopIteration

  0%|          | 0/274797 [01:57<?, ?it/s]