Open Purg opened 8 years ago
Since this can happen in the middle of GPU work, it can be left in a state where the GPU doesn't get to free its memory.
FWIW, this seems to be the best course of action: stopping X, calling nvidia-smi --gpu-reset
, and starting X again.
Haven't seen that yet... I'm assuming it happened to you?
Yes.
Welp, more reason to fix this thing again...
i'm seeing something similar here @danlamanna and @Purg when trying the SMQTK quickstart and docker. I have 50 images and it just hangs building the network...sometimes it gets to batch 2, sometimes stays in batch 1:
I0422 04:53:40.881229 18 net.cpp:752] Ignoring source layer loss
DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Network data shape: (10, 3, 227, 227)
DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer
DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -> {'data': (10, 3, 227, 227)}
DEBUG - 2018-04-22 04:53:40,951 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Loading image mean
DEBUG - 2018-04-22 04:53:40,952 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Image mean file not a numpy array, assuming protobuf binary.
DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- mean
DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- transpose
DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- channel swap
INFO - 2018-04-22 04:53:41,329 - __main__.run_file_list - Computing descriptors
DEBUG - 2018-04-22 04:53:41,330 - smqtk.compute_functions.compute_many_descriptors - Using single async call
DEBUG - 2018-04-22 04:53:41,331 - smqtk.compute_functions.compute_many_descriptors - Computing descriptors
DEBUG - 2018-04-22 04:53:41,331 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Checking content types; aggregating data/descriptor elements.
DEBUG - 2018-04-22 04:53:41,332 - smqtk.utils.parallel[check-file-type].parallel_map - Using all cores (2)
DEBUG - 2018-04-22 04:53:42,613 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.report_progress - Loops per second 29.597158 (avg 29.597158) (31 this interval / 31 total)
DEBUG - 2018-04-22 04:53:43,505 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Given 49 unique data elements
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - 0 descriptors already computed
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Converting deque to tuple for segmentation
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing 6 batches of size 8
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing tail group of size 1
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Starting batch: 1 of 6
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Updating network data layer shape (8 images)
DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Loading image pixel arrays
DEBUG - 2018-04-22 04:53:43,912 - smqtk.utils.parallel.parallel_map - Using all cores (2)
Any ideas?
BTW I'm using SMQTK and Image Space qiuckstart dockers...the ones that ref one another.
FWIW I was able to get this working but only by repetitively stopping and starting smqtk-services docker...over and over....and randomly it works all the way sometimes for my 6 batches of ~50 images, and 90% of the time it just hangs.
When Ctrl-C'ing a parallel-map in progress, an dead-lock can occur.
It has also been seen that if the workers are doing web-requests, they can lock up, possibly due to an infinite wait issue with the request. Then the threads or processes are killed externally, the function dead-locks and can't clean itself up properly.