RUCAIBox / NCL

[WWW'22] Official PyTorch implementation for "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning".
117 stars 20 forks source link

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #5

Open Nishikata97 opened 2 years ago

Nishikata97 commented 2 years ago

你好,我严格遵守了README.md提供的环境版本,但遇到了一个无法解决的报错信息,这个错误可能是线程同步的问题,报了内存非法访问的错误。因为我单步调试是可以运行的,但直接运行就报出了如下的错误: Train 0: 0%| | 0/358 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/nishikata/Downloads/NCL-master/main.py", line 71, in <module> run_single_model(args) File "/home/nishikata/Downloads/NCL-master/main.py", line 45, in run_single_model train_data, valid_data, saved=True, show_progress=config['show_progress'] File "/home/nishikata/Downloads/NCL-master/trainer.py", line 47, in fit train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress) File "/home/nishikata/Downloads/NCL-master/trainer.py", line 133, in _train_epoch losses = loss_func(interaction) File "/home/nishikata/Downloads/NCL-master/ncl.py", line 217, in calculate_loss user_all_embeddings, item_all_embeddings, embeddings_list = self.forward() File "/home/nishikata/Downloads/NCL-master/ncl.py", line 139, in forward all_embeddings = torch.sparse.mm(self.norm_adj_mat, all_embeddings) File "/home/nishikata/anaconda3/envs/NCL/lib/python3.7/site-packages/torch/sparse/__init__.py", line 84, in mm return torch._sparse_mm(mat1, mat2) RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

请问你在代码调试过程中有没有遇到过这种问题,谢谢啦。

hyp1231 commented 2 years ago

你好!感谢指出这个问题,我确实遇到过这个报错。最开始实验时在 cudatoolkit==10.1 的环境中就可以跑通,后来在 cudatoolkit==11.3 的环境下就会报和这个一样的错误。

我们如果找到具体问题后会更新代码并在本 issue 下回复,建议先在 cudatoolkit==10.1 的环境下进行测试。

Nishikata97 commented 2 years ago

你好,现有环境运行在2080ti是没有问题的。因为实验室条件受限,3090无法兼容cudatoolkit==11.1之前的版本,初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。 conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch 上述命令可以解决这个问题。

Requirements

python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1

下面是Yelp数据集的结果:

11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])
hyp1231 commented 2 years ago

你好,现有环境运行在2080ti是没有问题的。因为实验室条件受限,3090无法兼容cudatoolkit==11.1之前的版本,初步怀疑是cudatoollkit与faiss-gpu的版本兼容问题。 conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch conda install faiss-gpu cudatoolkit=11.3 -c pytorch 上述命令可以解决这个问题。

Requirements

python==3.8.13
cudatoolkit==11.3.1
pytorch==1.11.0
faiss-gpu==1.7.2
recbole==1.0.1

下面是Yelp数据集的结果:

11 Apr 12:19    INFO  best valid : OrderedDict([('recall@10', 0.0901), ('recall@20', 0.1339), ('recall@50', 0.2138), ('ndcg@10', 0.0666), ('ndcg@20', 0.08), ('ndcg@50', 0.1013)])
11 Apr 12:19    INFO  test result: OrderedDict([('recall@10', 0.0912), ('recall@20', 0.1358), ('recall@50', 0.2171), ('ndcg@10', 0.0679), ('ndcg@20', 0.0815), ('ndcg@50', 0.103)])

感谢反馈!