decisionintelligence / pathformer

133 stars 16 forks source link

Model running produces NaN values. #4

Open Tbabtm opened 5 months ago

Tbabtm commented 5 months ago

Hello, thank you for your work. However, I encountered some issues when using your model. When training the model with my dataset, I encountered NaN values. My dataset has the same format as weather.csv, but with different field values and numbers of fields. Interestingly, the same dataset can be trained on other models without any issues, such as ICLR's spotlight 'Itransformer'. When training with your model, all parameters remain unchanged, and training with seq_len=96 and pred_len in [96, 192, 336] results in NaN values and failure. However, training with seq_len=96 and pred_len=336 does not result in NaN values and is successful. I believe my data is fine, so there might be some bugs in your model. The specific error message is as follows: Traceback (most recent call last): File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/run.py", line 112, in <module> exp.train(setting) File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/exp/exp_main.py", line 143, in train outputs, balance_loss = self.model(batch_x) File "/data/zhangshi/.conda/envs/jj-commonenvs/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/models/PathFormer.py", line 57, in forward out, aux_loss = layer(out) File "/data/zhangshi/.conda/envs/jj-commonenvs/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/layers/AMS.py", line 103, in forward gates, load = self.noisy_top_k_gating(new_x, self.training) File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/layers/AMS.py", line 94, in noisy_top_k_gating load = (self._prob_in_top_k(clean_logits, noisy_logits, noise_stddev, top_logits)).sum(0) File "/data/zhangshi/jiangjun/remote/pywork/tmp/pycharm_project_431/layers/AMS.py", line 61, in _prob_in_top_k prob_if_in = normal.cdf((clean_values - threshold_if_in) / noise_stddev) File "/data/zhangshi/.conda/envs/jj-commonenvs/lib/python3.10/site-packages/torch/distributions/normal.py", line 87, in cdf self._validate_sample(value) File "/data/zhangshi/.conda/envs/jj-commonenvs/lib/python3.10/site-packages/torch/distributions/distribution.py", line 300, in _validate_sample raise ValueError( ValueError: Expected value argument (Tensor of shape (256, 4)) to be within the support (Real()) of the distribution Normal(loc: tensor([0.], device='cuda:0'), scale: tensor([1.], device='cuda:0')), but found invalid values: tensor([[nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], ..., [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan]], device='cuda:0', grad_fn=<DivBackward0>)

shandongpengyuyan commented 3 months ago

Have you solved this problem

wickedCuriosity commented 3 months ago

我也遇到这个问题了。兄弟你解决了嘛?

Tbabtm commented 3 months ago

不好意思,当时就是想着复现一下,运行出问题我就没管了,我觉得还是发邮件问作者吧

------------------ 原始邮件 ------------------ 发件人: "decisionintelligence/pathformer" @.>; 发送时间: 2024年6月25日(星期二) 晚上6:27 @.>; @.**@.>; 主题: Re: [decisionintelligence/pathformer] Model running produces NaN values. (Issue #4)

我也遇到这个问题了。兄弟你解决了嘛?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Tbabtm commented 3 months ago

Sorry, I was just trying to replicate it. When I ran into problems, I didn't bother to fix them. I think it's best to email the author.

------------------ 原始邮件 ------------------ 发件人: "decisionintelligence/pathformer" @.>; 发送时间: 2024年6月18日(星期二) 下午3:03 @.>; @.**@.>; 主题: Re: [decisionintelligence/pathformer] Model running produces NaN values. (Issue #4)

Have you solved this problem

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

PengChen12 commented 1 month ago

We have resolved this issue, and you can download the new pathformer code from the GitHub repository. Note that you need to set the batch_norm parameter to 1 when running your dataset.