ValueError: Encountering NaN Values in Model Training with PathFormer on Custom Dataset

wickedCuriosity commented 3 months ago

I am encountering an issue while training the PathFormer model with my own custom dataset: NaN values appear during some epochs, causing the training process to halt. Below is the specific error message: Traceback (most recent call last):

File "D:\多模态项目\pathformer-main\run.py", line 108, in exp.train(setting) File "D:\多模态项目\pathformer-main\exp\exp_main.py", line 147, in train outputs, balance_loss = self.model(batch_x) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "D:\多模态项目\pathformer-main\models\PathFormer.py", line 56, in forward out, aux_loss = layer(out) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "D:\多模态项目\pathformer-main\layers\AMS.py", line 104, in forward gates, load = self.noisy_top_k_gating(new_x, self.training) File "D:\多模态项目\pathformer-main\layers\AMS.py", line 95, in noisy_top_k_gating load = (self._prob_in_top_k(clean_logits, noisy_logits, noise_stddev, top_logits)).sum(0) File "D:\多模态项目\pathformer-main\layers\AMS.py", line 62, in _prob_in_top_k prob_if_in = normal.cdf((clean_values - threshold_if_in) / noise_stddev) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\distributions\normal.py", line 93, in cdf self._validate_sample(value) File "D:\anaconda3\envs\pathformer\lib\site-packages\torch\distributions\distribution.py", line 312, in _validate_sample raise ValueError( ValueError: Expected value argument (Tensor of shape (64, 4)) to be within the support (Real()) of the distribution Normal(loc: tensor([0.], device='cuda:0'), scale: tensor([1.], device='cuda:0')), but found invalid values: tensor([[ 0.4240, 0.4842, 0.4915, 0.2938], [-0.6390, -0.7151, -0.5837, -0.6872], [-0.2870, -0.2182, -0.0751, -0.3691], ... [ nan, nan, nan, nan], [ 0.4950, 0.2916, 0.6369, 0.3248], [ 0.3736, 0.4382, 0.6176, 0.2634], [ 1.0124, 0.8702, 1.1794, 0.9334], ... [-0.1174, -0.1580, -0.0815, -0.0602]], device='cuda:0', grad_fn=)

Process finished with exit code 1 Before encountering the NaN error, the training logs were as follows:

Epoch: 8, Steps: 1165 | Train Loss: 0.0754079 Vali Loss: 0.1464513 Test Loss: 0.0984371 Validation loss decreased (0.146857 --> 0.146451). Saving model ... Updating learning rate to 0.0009999999873752242 iters: 100, epoch: 9 | loss: 0.0572389 speed: 1.7723s/iter; left time: 24601.7166s iters: 200, epoch: 9 | loss: 0.0790029 speed: 0.3585s/iter; left time: 4940.7230s iters: 300, epoch: 9 | loss: 0.0669244 speed: 0.3580s/iter; left time: 4897.5288s iters: 400, epoch: 9 | loss: 0.0766764 speed: 0.3553s/iter; left time: 4824.7189s iters: 500, epoch: 9 | loss: 0.0636202 speed: 0.3570s/iter; left time: 4812.2685s iters: 600, epoch: 9 | loss: 0.0846734 speed: 0.3600s/iter; left time: 4817.5801s iters: 700, epoch: 9 | loss: 0.0806210 speed: 0.3584s/iter; left time: 4760.2747s iters: 800, epoch: 9 | loss: 0.0870242 speed: 0.3558s/iter; left time: 4689.8037s iters: 900, epoch: 9 | loss: 0.0674230 speed: 0.3551s/iter; left time: 4644.6707s I am confident that my original data is fine; all previous epochs were running smoothly. So, why is this situation occurring? Thank you for your help!

LimFang commented 2 months ago

u may try "torch.autograd.detect_anomaly():" to see what happends when backwards

unihe commented 1 month ago

hi, have you solve this problem?

PengChen12 commented 1 month ago

We have fixed this issue, and you can download the new pathformer code from the GitHub repository. Note that you need to set the batch_norm parameter to 1 when running your dataset.

shandongpengyuyan commented 1 month ago

We have fixed this issue, and you can download the new pathformer code from the GitHub repository. Note that you need to set the batch_norm parameter to 1 when running your dataset.

May I ask you a question.Is the pathformer channel independent or channel dependent? Thank you for your help!

PengChen12 commented 1 month ago

channel dependent

发自我的iPhone

------------------ Original ------------------ From: shandongpengyuyan @.> Date: Sat,Aug 17,2024 2:49 PM To: decisionintelligence/pathformer @.> Cc: Peng Chen @.>, Comment @.> Subject: Re: [decisionintelligence/pathformer] ValueError: Encountering NaNValues in Model Training with PathFormer on Custom Dataset (Issue #9)

wickedCuriosity commented 1 month ago

We have fixed this issue, and you can download the new pathformer code from the GitHub repository. Note that you need to set the batch_norm parameter to 1 when running your dataset.

Thank you for your response. Could you please provide more guidance on what might be causing this issue? I would be grateful for any insights you could share on how to resolve it.

decisionintelligence / pathformer

ValueError: Encountering NaN Values in Model Training with PathFormer on Custom Dataset #9