Closed nikky4D closed 1 year ago
I was able to fix it by changing the workers input to 0 in the train.py
Thanks for your feedback! We haven't seen this one, but we have seen cases where it requires --workers=1. Does it work only with 0, or also with 1? It's a GPU resource limitation.
I've tried it with --workers = 1, and I get the same error, only this time, a single pair, and not 4 pairs of the error lines
Using the evaluate_tinierssd with no changes
(ai8x-training-py38) nuzuegbunam@NUZUEGBUNAM:~/Projects/ARGUS/training/ai8x-training$ scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115437/2023.04.28-115437.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key | Type | Value |
|----------------------+-------------+---------------|
| arch | str | ai85tinierssd |
| compression_sched | dict | |
| epoch | int | 57 |
| extras | dict | |
| optimizer_state_dict | dict | |
| optimizer_type | type | Adam |
| state_dict | OrderedDict | |
+----------------------+-------------+---------------+
=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key | Type | Value |
|-----------------+--------+---------------|
| best_epoch | int | 57 |
| best_mAP | float | 1.0 |
| best_top1 | int | 0 |
| clipping_method | str | MAX_BIT_SHIFT |
| current_mAP | float | 1.0 |
| current_top1 | int | 0 |
+-----------------+--------+---------------+
Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548
Test dataset length: 12251
Dataset sizes:
training=25694
validation=2854
test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5716_1742722683> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
Traceback (most recent call last):
RuntimeError: unable to open shared memory object </torch_5704_3516087661> in read-write mode
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5740_4021780443> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5728_2881106096> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5704_4120698264> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5716_2049793226> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5740_1593117888> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5728_2656581644> in read-write mode
Using evaluate tinierssd with workers = 1
scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115746/2023.04.28-115746.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key | Type | Value |
|----------------------+-------------+---------------|
| arch | str | ai85tinierssd |
| compression_sched | dict | |
| epoch | int | 57 |
| extras | dict | |
| optimizer_state_dict | dict | |
| optimizer_type | type | Adam |
| state_dict | OrderedDict | |
+----------------------+-------------+---------------+
=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key | Type | Value |
|-----------------+--------+---------------|
| best_epoch | int | 57 |
| best_mAP | float | 1.0 |
| best_top1 | int | 0 |
| clipping_method | str | MAX_BIT_SHIFT |
| current_mAP | float | 1.0 |
| current_top1 | int | 0 |
+-----------------+--------+---------------+
Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548
Test dataset length: 12251
Dataset sizes:
training=25694
validation=2854
test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5822_1739173315> in read-write mode
Traceback (most recent call last):
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5822_754801957> in read-write mode
I don't see the errors when I set workers 0, getting the following
scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115940/2023.04.28-115940.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key | Type | Value |
|----------------------+-------------+---------------|
| arch | str | ai85tinierssd |
| compression_sched | dict | |
| epoch | int | 57 |
| extras | dict | |
| optimizer_state_dict | dict | |
| optimizer_type | type | Adam |
| state_dict | OrderedDict | |
+----------------------+-------------+---------------+
=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key | Type | Value |
|-----------------+--------+---------------|
| best_epoch | int | 57 |
| best_mAP | float | 1.0 |
| best_top1 | int | 0 |
| clipping_method | str | MAX_BIT_SHIFT |
| current_mAP | float | 1.0 |
| current_top1 | int | 0 |
+-----------------+--------+---------------+
Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548
Test dataset length: 12251
Dataset sizes:
training=25694
validation=2854
test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Test: [ 48/ 48] Loss 69.324744 mAP 0.750842
==> mAP: 0.75084 Loss: 69.325
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115940/2023.04.28-115940.log
Thanks again! We will add this to the documentation.
I am running evaluation on the tinierssd weights saved in trained from the repo. However, I am getting the error, and everything pauses after:
Can someone direct me as to what I can do to fix it?
My full printout is this:
I am able to evaluate other models: cifar10, mnist. But this keeps happening for tinierssd with SVHN