eXascaleInfolab / ActiveLink

Deep active learning framework for link prediction in knowledge graph
24 stars 7 forks source link

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:196 #4

Open t170815518 opened 4 years ago

t170815518 commented 4 years ago

Hi, I am trying to run the code on GPU, however the RuntimeError 710 occurs:

2020-08-20 14:39:31,710 INFO  1 iteration of active learning: started
2020-08-20 14:39:31,711 INFO  Train model: started
2020-08-20 14:39:31,711 INFO  1 epoch: started
2020-08-20 14:39:31,711 INFO  Inner loop: started
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=196 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "main.py", line 146, in <module>
    main()
  File "main.py", line 135, in main
    model = run_meta_incremental(config, model, train_batcher, test_rank_batcher)
  File "/home/auser/buser/ActiveLink/meta_incr_training.py", line 158, in run_meta_incremental
    g = run_inner(config, model, task)
  File "/home/auser/buser/ActiveLink/meta_incr_training.py", line 120, in run_inner
    pred = model.forward(e1, rel)
  File "/home/auser/buser/ActiveLink/models.py", line 50, in forward
    stacked_inputs = torch.cat([e1_embedded, rel_embedded], 2)  # out: (128L, 1L, 20L, 20L)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:196
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [114,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
t170815518 commented 4 years ago

I use Debugger in an attempt to find out where goes wrong: Before e1 and rel are embedded, they are both tensors in int64 with the shape of torch.Size([128, 1]).

e1 can be embedded as normal, converting into torch.float32 and torch.Size([128, 1, 10, 20]). However, after rel passed the embedding layer of emb_rel, Debugger shows all tenors as Unable to get repr for <class 'torch.Tensor'>.

t170815518 commented 4 years ago

It's because that, take dataset FB15k-237 for example,relation2id.txt includes the reverse relationship, and build_vocabs() in main.py deals with it redundantly, leading to a wrong id mapping that cannot match with the embedding layer's look-up table size. See https://github.com/eXascaleInfolab/ActiveLink/issues/5#issue-683591605 also.