amzn / pecos

PECOS - Prediction for Enormous and Correlated Spaces
https://libpecos.org/
Apache License 2.0
516 stars 105 forks source link

Floating point exception (core dumped) problem #273

Open wykk00 opened 11 months ago

wykk00 commented 11 months ago

Description

I face a problem when I try to reproduce the paper code GIANT. I used my own text-atttibuted graph dataset and followed the data processing instruction by GIANT.

It seems really strange that this problem occurred at training level 1, while it can be well at training level 0. I try to direct this issue, and the only problem I can find is that it may occur at sparse_matmul() function in matcher._predict().

Steps to reproduce

The command is

CUDA_VISIBLE_DEVICES=1 python3 -m pecos.xmc.xtransformer.train -t X.trn.txt -x X.trn.tfidf.npz -y Y.trn.npz -m xrt_models --batch-gen-workers 0

Error message or code output

12/29/2023 13:02:58 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7150/  7220] | 1373/1444 batches | ms/batch 451.6586 | train_loss 7.300417e-01 | lr 9.695291e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | [   5/   5][  7200/  7220] | 1423/1444 batches | ms/batch 451.0563 | train_loss 7.260027e-01 | lr 2.770083e-07
12/29/2023 13:03:24 - INFO - pecos.xmc.xtransformer.matcher - | **** saving model (avg_prec=0) to /tmp/tmpo8wg3j8h at global_step 7200 ****
12/29/2023 13:03:26 - INFO - pecos.xmc.xtransformer.matcher - -----------------------------------------------------------------------------------------
12/29/2023 13:03:36 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmpo8wg3j8h
Floating point exception (core dumped)

Environment