Open tommy87 opened 8 years ago
Sorry about that. That's annoying. The problem is that DIGITS is working really hard to compress your huge file, and it can't get the job done before hitting the gunicorn timeout. If you use a link like:
/models/20160317-160504-f18c/download.tar
Instead of
/models/20160317-160504-f18c/download
Then you'll get the uncompressed tarball, which DIGITS should be able to create before it hits the timeout (full list of allowed extensions here).
If that still doesn't work, you should be able to change the gunicorn timeout value (http://docs.gunicorn.org/en/stable/settings.html#timeout). I think you'd want to add timeout = 60
to /usr/share/digits/gunicorn_config.py
and restart your server, but I haven't tested it. Let me know if it comes to that and I can help you.
sorry i forget to answer, but increasing the timeout has helped me :)
But maybe you shouldn't wait for a timeout, i think it is better to ask the worker how far he is and if he doesn't response or have no progress then you can break the process
The problem is that we don't currently use a worker to do the compression - it's done by the server process. That's why the server locks up.
Glad to hear the timeout hack was helpful!
By removing gunicorn
in https://github.com/NVIDIA/DIGITS/pull/1127, we've sort of sidestepped the issue for now since you'll be accessing Flask through werkzeug now.
But we still need to address the fact that the server locks up when zipping a big model.
By navigating to /usr/share/digits/digits/jobs/ you can see the individual job folders that contain the .caffemodel, .prototxt, .solverstate, .pickle, and .log files.
I was able to "scp" the caffemodel that was failing every time from the GUI download button in DIGITS.
If anyone has any questions, let me know.
When try to download a big model (Total learned parameters: 192,498,050) i get the message:
Error: unable to connect to NVIDIA DIGITS
and the log file says: [1271] [CRITICAL] WORKER TIMEOUT (pid: 5706) [1271] [CRITICAL] WORKER TIMEOUT (pid: 5706) [5852] [INFO] Booting worker with pid: 5852