Floating point exception (core dumped) problem

Description

I face a problem when I try to reproduce the paper code GIANT. I used my own text-atttibuted graph dataset and followed the data processing instruction by GIANT.

It seems really strange that this problem occurred at training level 1, while it can be well at training level 0. I try to direct this issue, and the only problem I can find is that it may occur at sparse_matmul() function in matcher._predict().

Steps to reproduce

The command is

CUDA_VISIBLE_DEVICES=1 python3 -m pecos.xmc.xtransformer.train -t X.trn.txt -x X.trn.tfidf.npz -y Y.trn.npz -m xrt_models --batch-gen-workers 0

Error message or code output

12/29/2023 13:02:58 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7150/  7220] | 1373/1444 batches | ms/batch 451.6586 | train_loss 7.300417e-01 | lr 9.695291e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7200/  7220] | 1423/1444 batches | ms/batch 451.0563 | train_loss 7.260027e-01 | lr 2.770083e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | **** saving model (avg_prec=0) to /tmp/tmpo8wg3j8h at global_step 7200 ****
12/29/2023 13:03:26 - INFO - pecos.xmc.xtransformer.matcher - -----------------------------------------------------------------------------------------
12/29/2023 13:03:36 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmpo8wg3j8h
Floating point exception (core dumped)

Environment

Operating system: Ubuntu-22.04.1 (X86)
Python version: 3.9.18
PECOS version: 1.2.2
torch: 1.13.1
numpy: 1.26.2
scipy: 1.11.4
transformers: 4.36.2

amzn / pecos

Floating point exception (core dumped) problem #273

Description

Steps to reproduce

Error message or code output

Environment