NVIDIA / DIGITS

Deep Learning GPU Training System
https://developer.nvidia.com/digits
BSD 3-Clause "New" or "Revised" License
4.11k stars 1.38k forks source link

Unable to Download big models #644

Open tommy87 opened 8 years ago

tommy87 commented 8 years ago

When try to download a big model (Total learned parameters: 192,498,050) i get the message:

Error: unable to connect to NVIDIA DIGITS

and the log file says: [1271] [CRITICAL] WORKER TIMEOUT (pid: 5706) [1271] [CRITICAL] WORKER TIMEOUT (pid: 5706) [5852] [INFO] Booting worker with pid: 5852

lukeyeager commented 8 years ago

Sorry about that. That's annoying. The problem is that DIGITS is working really hard to compress your huge file, and it can't get the job done before hitting the gunicorn timeout. If you use a link like:

/models/20160317-160504-f18c/download.tar

Instead of

/models/20160317-160504-f18c/download

Then you'll get the uncompressed tarball, which DIGITS should be able to create before it hits the timeout (full list of allowed extensions here).

If that still doesn't work, you should be able to change the gunicorn timeout value (http://docs.gunicorn.org/en/stable/settings.html#timeout). I think you'd want to add timeout = 60 to /usr/share/digits/gunicorn_config.py and restart your server, but I haven't tested it. Let me know if it comes to that and I can help you.

tommy87 commented 7 years ago

sorry i forget to answer, but increasing the timeout has helped me :)

But maybe you shouldn't wait for a timeout, i think it is better to ask the worker how far he is and if he doesn't response or have no progress then you can break the process

lukeyeager commented 7 years ago

The problem is that we don't currently use a worker to do the compression - it's done by the server process. That's why the server locks up.

Glad to hear the timeout hack was helpful!

lukeyeager commented 7 years ago

By removing gunicorn in https://github.com/NVIDIA/DIGITS/pull/1127, we've sort of sidestepped the issue for now since you'll be accessing Flask through werkzeug now.

But we still need to address the fact that the server locks up when zipping a big model.

andrewcar commented 6 years ago

By navigating to /usr/share/digits/digits/jobs/ you can see the individual job folders that contain the .caffemodel, .prototxt, .solverstate, .pickle, and .log files.

I was able to "scp" the caffemodel that was failing every time from the GUI download button in DIGITS.

If anyone has any questions, let me know.