davisking / dlib

A toolkit for making real world machine learning and data analysis applications in C++
http://dlib.net
Boost Software License 1.0
13.58k stars 3.38k forks source link

Training gets stuck on GeForce RTX 2080 Ti #1513

Closed reunanen closed 6 years ago

reunanen commented 6 years ago

I'm training semantic-segmentation networks, along the lines of #288.

The code works great on GeForce GTX products, but with new RTX hardware training gets stuck randomly. Looks like a race condition or similar, because the freeze happens after multiple iterations and usually after different number of iterations (from run to run).

Using CUDA 10 and cuDNN 7.3 on 64-bit Windows (but same issue with CUDA 8 and cuDNN 5). Latest master from GitHub. MSVC debugger shows the code is waiting on this line.

Because steps to reproduce include acquisition of RTX hardware, I'll rather spend time trying to debug the issue than writing a complete set of steps – at least at this point.

davisking commented 6 years ago

That's weird. cudaStreamSynchronize should never hang forever. It might be a bug in CUDA. What happens if you comment out the call to cudaStreamSynchronize? Does it stop hanging?

reunanen commented 6 years ago

I agree that this might be a bug in CUDA.

The process consumes 100% of one CPU core while it's in cudaStreamSynchronize.

If I change cudaStreamSynchronize to a loop that polls cudaStreamQuery until cudaSuccess is returned, it seems that the problem goes away. But need more testing to be sure about it.

reunanen commented 6 years ago

Interestingly, I haven't been able to reproduce this problem using the semantic segmentation example. But I guess this is normal if it's a race condition or similar – given our application and data, the bottlenecks are surely in somewhat different places, than with the example program/data.

davemers0160 commented 6 years ago

@reunanen I've started seeing the same thing, with the call to cudaStreamSynchronize(0) causing a hang. One thing that I've noticed is that it goes away when I switch from release mode to debug mode in VS2017 (v15.7.5). This has happened on two different computers, both running Win10. Once machine has a GTX-108Ti and the other has a Quadro M6000. Both are up to date on the drivers and CudaToolKit10.0 / cudnn v7.3.0.29. Similarly, I've only seen this happen on one set of dnn architectures. Not sure if it is a size issue but the nets only have 43 layers, 8 of which are tag1/add_prev1 pairs.

davisking commented 6 years ago

Maybe it's just a windows issues then instead of something specific to that card. @reunanen, have you seen the issue on Linux? Maybe the PR should have an #ifdef that only applies it to windows if not.

reunanen commented 6 years ago

Yeah, it could well be a timing issue made visible by the new card (and/or a specific dnn architecture). Haven't seen it on Linux, but haven't tried either.

Would be a little surprised though, because we've run identical code on single and dual GTX 1080 Ti rigs for countless hours already, without issues. Whereas on RTX 2080 Ti, the occurrence of the hang is a matter of minutes, not hours. But yeah, I know – if it's a race condition or such, then even this is not too unexpected.

@davemers0160 if you cherry-pick #1514, does your release-mode build still hang?

davemers0160 commented 6 years ago

I won't have access to the two computers until Monday. I will try then.

I also just noticed that there is a new driver for the GTX series. For the systems that I'm using the drivers are both version 411.63. It might be worth checking to see if an update to the driver might fix the problem.

davemers0160 commented 6 years ago

Also I haven't been brave enough to upgrade my Ubuntu 16.04 system from Cuda Toolkit 9.1 to 10.0.

davemers0160 commented 6 years ago

I was able to incorporate #1514 on my Quadro machine and was able to complete a training session (~5hrs). Before making the change on my GTX-1080 machine I upgraded the driver to 416.34, but this did not help, it still stalled within the first few minutes of training. I then made the 1514 change and was able to complete the training (~5hrs).

davisking commented 6 years ago

@davemers0160, did the error occur only on windows?

davemers0160 commented 6 years ago

I've only run across the error on Windows 10. I haven't tested any Win7 or Win8.1 machines. I also haven't upgraded any Ubuntu machines to Cuda Toolkit v10.0.

xsacha commented 6 years ago

Someone here recently came across this issue and wasn't sure what was causing it. Some interesting finds regarding this hang: Works with CUDA 9.1 (CuDNN 7.1) + Old nvidia driver Didn't work with CUDA 9.2 (CuDNN 7.2) + Old or new nvidia driver. This is how we were experiencing this issue! Works with CUDA 10 (CuDNN 7.4) + New nvidia driver. This is what we now use for training.

This is happening with a Titan V on Windows 10.

The hang was experienced after ~10 minutes of training. We noticed it then hits 100% CPU stuck in synchronisation. Also, it does not occur in Debug mode or if optimisations are turned off.

We were using CUDA 9.2 because it's used for the inferencing application where we need CUDA 9.2 and CUDA 10 requires drivers that are too new. We have no such issues in inferencing, only training.

Considering everyone here says the issue still occurs on CUDA 10, I'm not sure why CUDA 10 is working for us.

reunanen commented 6 years ago

@xsacha Did you test with #1514?

I think this continues to sound like a timing issue – race condition or so. Whether your application code is Debug mode or has optimizations turned off shouldn't change the CUDA code that is executed, but it does play a role in how quickly different things are completed, in particular relative to each other (and same for using different dnn architectures, as in @davemers0160's case; or different GPU hardware, as in mine).

So to me it sounds like this is a latent bug that pretty much every version has, but most combinations of application code / dnn architecture / GPU hardware just don't make the bug surface.

xsacha commented 6 years ago

Ok, just an update. I reported before that I had it working with CUDA 10. It definitely is not working with CUDA 10, it just happens less!

CUDA 9.1 is the only release where I have no issues (without #1514 applied).

I haven't tried #1514 yet. Will get it when the next release occurs.

I think you're right about the timing issue. However, I've never seen this issue in CUDA 9.1 (which is our most tested release). So I'm happy to continue using it until I get this patch.

davemers0160 commented 5 years ago

Just wanted to give you all an update on this. I just ran into this problem 3 times out of about 57 different trials. Same architecture for each trial. The only difference now is that I'm seeing it on CUDA 9.0, for an IBM Power8 machine running RHEL7 with 4 P100's. This was using dlib-19.15. I'm going to try the patch, but with ~5% chance of the stall happening I might not catch it.

davemers0160 commented 5 years ago

Ok. So after several experiments I still am getting seeing the training sticking every once in a while even after making the changes in #1514. It seems that if I don't load the GPUs down as much as possible then the stall happens more often. I still have now seen the stall occur with the changes implemented on a Win 10/Cuda 10 machine.

xsacha commented 5 years ago

Has anyone worked out if this only happens on WDDM2 driver or it happens on TCC too?

reunanen commented 5 years ago

@davemers0160 : So is it that cudaStreamQuery never returns cudaSuccess, or simply doesn't return at all?

Pre-#1514, cudaStreamSynchronize didn't return at all (at least in the cases I saw).

davemers0160 commented 5 years ago

@reunanen : On the windows side it was always cudaStreamSynchronize(0) that would never return. However on the RHEL 7 side I can only guess that the hang is occurring here cudaError_t err = cudaStreamQuery(stream); because I get no error message returned. This system is an HPC cluster so I don't have a lot of access or ability to debug since it pretty much submit the job and then peek in on every once in awhile to make sure that it is still running.

reunanen commented 5 years ago

@davemers0160 So after #1514 it's still occurring on Windows as well? And still freezing specifically in cudaStreamSynchronize(0)? If so, then I guess the remaining calls need to be replaced also. Can do.

However, do note that (the merged version of) #1514 affects Windows only. So if you are seeing a similar freeze on RHEL also, then maybe the #ifdefs need to be removed after all.

davemers0160 commented 5 years ago

@reunanen : So far I have not seen the freeze happen on Windows after applying #1514.

reunanen commented 5 years ago

@davemers0160 Ok, thanks for the update. Could you please try #1596 on your RHEL setup? I simply made the #1514 fix apply also when not on Windows.

sshazly commented 1 year ago

I was running into a similar issue, after I upgraded my older GTX770 GPU to an RTX4090 which I had alongside my RTX2070 Super, I believe I was using the drivers for the 770 when the 2070 and 770 were installed side-by-side with an older version of CUDA (10.2)

In my particular case, the upgrade resulted in memory access violations at cudaStreamSynchronize, changing the source code to use the polling method introduced by this comment thread would also result in access violations so I went down a pretty terrible rabbit hole which I will try to document in this comment in the hopes that somebody finds an actual fix for this, the TLDR was that calling get_net() before train_one_step(batch_in, batch_target) in my training loop finally avoided all issues.

To begin, I am on Windows 11 (regrettably) using CUDA 12.1 on the RTX 4090 and 2070 cards, I believe having the 4090 has forced me to only use 12.1 since according to the compute-capability compatibility (say that fast 20 times) matrix that is the only one that supports cc 8.9.

I made sure, in my case, to install CUDA 12.1 (with the most recent NVidia drivers) and build dlib from source using the correct compute capability (8.9 and 7.5 for the 4090 and 2070). I installed NSight Compute and got a trace:

Before getting to the memory error I had a series of 40-60 error 209 (oddly only some of the cuLibraryGetModule API calls returned an error 209, and the commonality of every error was that the first parameter passed was a memory address with no offset ( image (3)

After clicking through these error 209s I eventually get to a memory access violation: image (1)

After going through dlib source code, and not really finding anything, I got really frustrated and needed to understand where the issue was coming from. I have a similar implementation of my code in pytorch, so I spent considerable time building from source (and upgrading since the cmake files are not compatible with the migration of NVidia nvToolsExt to header only in CUDA 12.1)

Again, making sure I was using the same cuda libraries and compute-capability to eliminate different versions as a possibility for success/failure across pytorch/dlib.

What I observed in pytorch with CUDA 12.1 on my RTX4090 and RTX2070, was very similar to what I was seeing in dlib. It was a bit more stable in pytorch, in the sense that when the application got past a certain point it was almost guaranteed to complete, whereas in dlib it would always crash by iteration ~20000. But I most definitely observed the same errors in pytorch (no pictures of the memory error trace yet, it's a bit of a pain to step through all those error 209s, especially when it does not happen every time..... also not really motivated to do so since I found a fix, even if it is terrible)

For your consideration here is the pytorch 209 error:

Screenshot_20230403-072113_Remote Desktop

The thing that finally worked for me was calling get_net before train_one_step in my training loop, it does not seem like it is a requirement based on some examples like dnn_introduction2_ex. So I'm not sure why adding that particular line of code finally got things working for dlib / CUDA 12.1 on Windows 11 @davisking, I have a hunch why this worked, but would love other peoples input.

BONUS CONTENT: Before the final fix, I had things semi-stable for about a week (it would complete a few runs before the memory access violations). But then last Friday a storm caused the lights to go out and when it finally got back on the application would consistently give memory errors after ~20000 iterations. That led me to check the core GPU voltage, thinking maybe something got fried? But core voltage was fine 0.88V under no load, 0.965V under load (card also seemed fine, can play skyrim on it with no issue) at which time I got frustrated and went down the above rabbit hole.

All signs point to a race condition somewhere in the CUDA libraries, or a change in the cuda api that dlib has not implemented yet?

davisking commented 1 year ago

Yeah near as I can tell this is a race condition in the cuda libraries. You definitely shouldn't have to call get_net() before training or anything like that. My gut feeling is that nvidia doesn't care about supporting windows as much as linux so there are just more bugs there. Not sure what can be done about it :shrug: