ProvenanceAnalytics / kairos

55 stars 22 forks source link

Bug in train.py of StreamSpot #11

Open Joney-Yf opened 4 months ago

Joney-Yf commented 4 months ago

Description

After I ran the train.py to train the model for a while, it reported an error:

/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [110,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [111,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [112,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [113,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [114,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
  0%|                                                                                                                                                                   | 0/10 [26:49<?, ?it/s]
Traceback (most recent call last):
  File "/data/yangfan/kairos/StreamSpot/src/train.py", line 231, in <module>
    loss = train()
  File "/data/yangfan/kairos/StreamSpot/src/train.py", line 159, in train
    n_id, edge_index, e_id = neighbor_loader(n_id)
  File "/home/yangfan/anaconda3/envs/kairos/lib/python3.9/site-packages/torch_geometric/nn/models/tgn.py", line 230, in __call__
    neighbors, nodes, e_id = neighbors[mask], nodes[mask], e_id[mask]
RuntimeError: CUDA error: device-side assert triggered

Possible Reason and solution

There is a line of code on line 132-133

  neg_dst = torch.randint(min_dst_idx, max_dst_idx + 1, (src.size(0),),
                                dtype=torch.long, device=device)

the neg_dst can get the value as max_dst_idx, which is the same as max_node. As a result, it will be out of the boundary because the maximum value of neg_dst should only be max_dst_idx -1.

So I then changed the code as follows:

  neg_dst = torch.randint(min_dst_idx, max_dst_idx, (src.size(0),),
                                dtype=torch.long, device=device)