../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [37,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.

LSH9832 / edgeyolo

an edge-real-time anchor-free object detector with decent performance

Apache License 2.0

446 stars 59 forks source link

../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [37,0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed. #16

Open ramonhollands opened 1 year ago

ramonhollands commented 1 year ago

Hi Shihan Liu,

Thanks for your work on this repo!

I'm trying to run a custom training myself, using your train script and the yolo formats. The dataset seems to be fine, looking at demo/load_dataset.py. Any ideas where I can start my bug hunt, given above error?

Best regards, Ramon

LSH9832 commented 1 year ago

this error occurs when

there's at least one NAN(not a number) element in your prediction, check by using torch.isnan(tensor). if so, reducing your learning rate might solve your problem, but sometimes it is because your installed CUDA version is too high for your GPU device(s)
```
for name, param in self.model.named_parameters():
if param.grad is not None and torch.isnan(param.grad).any():
    print("nan gradient name:", name)
```
the length of prediction does not match the length of labels, please check your dataset config and make sure it is correct.

LSH9832 commented 1 year ago

If it still cannot be solved and your custom dataset is shareable, you can send your dataset to my email(bitshliu@qq.com) for testing, and I will reply to you before the end of this week as far as possible.

ramonhollands commented 1 year ago

Thanks for your fast reply.

It looks like it's training correctly after changing the fp16 setting to false again. Could that be the cause? First 30 epochs have succeeded so far. I'll keep you posted.

LSH9832 commented 1 year ago

what type is your GPU? I guess that your GPU might not fully support AMP(auto-mixed-precision) training, I test training with fp16 enabled and it works well in RTX30 sesries and Tesla T4

ramonhollands commented 1 year ago

It's a RTX3090, running CUDA 11.4. I am doing some experiments these days and let you know. If you are interested in this dataset, I can share it with you.

Looking forward to eventually contribute to this repo.

LSH9832 commented 1 year ago

oh, it might because of the CUDA version. l'm using CUDA 11.1 for RTX3090 in ubuntu18.04. And I'd like to know more about your dataset and your experiments

x-yy0 commented 1 year ago

Thanks for your fast reply.

It looks like it's training correctly after changing the fp16 setting to false again. Could that be the cause? First 30 epochs have succeeded so far. I'll keep you posted.

I have the same problem after seting the fp16 to true.