training loop crashes at about loop 200+

yxcntu commented 4 months ago

Hey Sourish,

Encountered your YT video and has been learning from your tutorial these days, I just started following the same code and learning from how it works but it kept crashing at training loop 200+ in the image.

This timing is when my RAM go up to around 20GB out of my 32 GB ram + 32GB Swap ( just in case this info is helpful )

However it also crashes even when I tried it on a VastAI host with 96GB ram and RTX 3090.

Would you mind give some pointers if you've seen this?

Thanks a lot !

My GPU is as below:

Sourish07 commented 4 months ago

Hello, sorry for the late response. I was caught up on my most recent video.

Sorry, but I have never seen this issue before.

Looking at your nvidia-smi output, I'm not sure if the python script is even running on the GPU. Your GPU VRAM usage should definitely be much higher, so many start off by making sure the code is running on the GPU? Also, make sure your drivers are updated. Other than those two things, I'm not really sure what else might be worth trying.

Let me know if you make any breakthroughs!

yxcntu commented 4 months ago

Hello, sorry for the late response. I was caught up on my most recent video.

Sorry, but I have never seen this issue before.

Looking at your nvidia-smi output, I'm not sure if the python script is even running on the GPU. Your GPU VRAM usage should definitely be much higher, so many start off by making sure the code is running on the GPU? Also, make sure your drivers are updated. Other than those two things, I'm not really sure what else might be worth trying.

Let me know if you make any breakthroughs!

Thanks for your time and reply. Looking forward to your new video!

The nvidia-smi output is not during the programming run so it's not relevant, just want to show the driver version as there's forums saying that the error is related to nvidia driver version or CUDA version.

Anyway just want to check if you've seen the error. Might just be an unknown torchrl issue for some RAM or CPU configs.

Also I'm very pleased to see that the Agent is able to finish training with 60K (instead of 100K which caused above error) replay experiences too and the result is very promising as well, which might make sense as 60K is not too shy for the training anyway.

Sourish07 commented 4 months ago

I wish I could be of more help, but yeah it totally might just be a weird torchrl issue. So you were able to get it fixed by lowering the replay buffer size to 60k? That's actually very interesting... Not sure why that might be an issue.

How is your disk space looking? I know we're using memory mapped files for the replay buffer, so maybe that's the issue?

yxcntu commented 3 months ago

I wish I could be of more help, but yeah it totally might just be a weird torchrl issue. So you were able to get it fixed by lowering the replay buffer size to 60k? That's actually very interesting... Not sure why that might be an issue.

How is your disk space looking? I know we're using memory mapped files for the replay buffer, so maybe that's the issue?

sorry for late reply.. got carried away by other works..

yeah I was able to run it end to end using 60k buffer size on my rig setting but there's no way to find out why.. it was using only 10GB of my 32GB RAM so shouldn't be RAM issue. I also allocated 32GB of swap space and 0 was used during the whole time.

Anyway I got it running and it can finish the first episode successfully but I noticed one issue - if I stop the learning process after 50k loops, the model weights stay in fixed values and never change, and it will always predict the same result which means if I let it play the episode again, it will always play in the exactly the same way, and if it cannot find the flag it never will..

so I'm wondering how can we tell that the model is ready for delivery? should I place a check that whenever it reaches the flag (e.g. do a few more loops after 50k), I immediately stop the process and save the model? so that the model is guaranteed to reach the flag. or there's a way to make the model reach a state that it can always find the flag no matter what?

Sourish07 / Super-Mario-Bros-RL

training loop crashes at about loop 200+ #5