analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
55 stars 47 forks source link

sample tinierssd evaluation after synthesis throwing errors #293

Closed nikky4D closed 1 year ago

nikky4D commented 1 year ago

I am running evaluation on the tinierssd weights saved in trained from the repo. However, I am getting the error, and everything pauses after:

image

Can someone direct me as to what I can do to fix it?

My full printout is this:

image

I am able to evaluate other models: cifar10, mnist. But this keeps happening for tinierssd with SVHN

nikky4D commented 1 year ago

I was able to fix it by changing the workers input to 0 in the train.py

rotx-maxim commented 1 year ago

Thanks for your feedback! We haven't seen this one, but we have seen cases where it requires --workers=1. Does it work only with 0, or also with 1? It's a GPU resource limitation.

nikky4D commented 1 year ago

I've tried it with --workers = 1, and I get the same error, only this time, a single pair, and not 4 pairs of the error lines

image

nikky4D commented 1 year ago

Using the evaluate_tinierssd with no changes

(ai8x-training-py38) nuzuegbunam@NUZUEGBUNAM:~/Projects/ARGUS/training/ai8x-training$ scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115437/2023.04.28-115437.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key                  | Type        | Value         |
|----------------------+-------------+---------------|
| arch                 | str         | ai85tinierssd |
| compression_sched    | dict        |               |
| epoch                | int         | 57            |
| extras               | dict        |               |
| optimizer_state_dict | dict        |               |
| optimizer_type       | type        | Adam          |
| state_dict           | OrderedDict |               |
+----------------------+-------------+---------------+

=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key             | Type   | Value         |
|-----------------+--------+---------------|
| best_epoch      | int    | 57            |
| best_mAP        | float  | 1.0           |
| best_top1       | int    | 0             |
| clipping_method | str    | MAX_BIT_SHIFT |
| current_mAP     | float  | 1.0           |
| current_top1    | int    | 0             |
+-----------------+--------+---------------+

Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548

Test dataset length: 12251

Dataset sizes:
        training=25694
        validation=2854
        test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5716_1742722683> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
Traceback (most recent call last):
RuntimeError: unable to open shared memory object </torch_5704_3516087661> in read-write mode
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5740_4021780443> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5728_2881106096> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5704_4120698264> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5716_2049793226> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5740_1593117888> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5728_2656581644> in read-write mode

Using evaluate tinierssd with workers = 1

scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115746/2023.04.28-115746.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key                  | Type        | Value         |
|----------------------+-------------+---------------|
| arch                 | str         | ai85tinierssd |
| compression_sched    | dict        |               |
| epoch                | int         | 57            |
| extras               | dict        |               |
| optimizer_state_dict | dict        |               |
| optimizer_type       | type        | Adam          |
| state_dict           | OrderedDict |               |
+----------------------+-------------+---------------+

=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key             | Type   | Value         |
|-----------------+--------+---------------|
| best_epoch      | int    | 57            |
| best_mAP        | float  | 1.0           |
| best_top1       | int    | 0             |
| clipping_method | str    | MAX_BIT_SHIFT |
| current_mAP     | float  | 1.0           |
| current_top1    | int    | 0             |
+-----------------+--------+---------------+

Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548

Test dataset length: 12251

Dataset sizes:
        training=25694
        validation=2854
        test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5822_1739173315> in read-write mode
Traceback (most recent call last):
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
  File "/home/nuzuegbunam/miniconda3/envs/ai8x-training-py38/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_5822_754801957> in read-write mode

I don't see the errors when I set workers 0, getting the following

scripts/evaluate_svhn_tinierssd.sh
Configuring device: MAX78000, simulate=True.
Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115940/2023.04.28-115940.log
{'start_epoch': 25, 'weight_bits': 8, 'shift_quantile': 0.995}
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
=> loading checkpoint ../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar
=> Checkpoint contents:
+----------------------+-------------+---------------+
| Key                  | Type        | Value         |
|----------------------+-------------+---------------|
| arch                 | str         | ai85tinierssd |
| compression_sched    | dict        |               |
| epoch                | int         | 57            |
| extras               | dict        |               |
| optimizer_state_dict | dict        |               |
| optimizer_type       | type        | Adam          |
| state_dict           | OrderedDict |               |
+----------------------+-------------+---------------+

=> Checkpoint['extras'] contents:
+-----------------+--------+---------------+
| Key             | Type   | Value         |
|-----------------+--------+---------------|
| best_epoch      | int    | 57            |
| best_mAP        | float  | 1.0           |
| best_top1       | int    | 0             |
| clipping_method | str    | MAX_BIT_SHIFT |
| current_mAP     | float  | 1.0           |
| current_top1    | int    | 0             |
+-----------------+--------+---------------+

Loaded compression schedule from checkpoint (epoch 57)
=> loaded 'state_dict' from checkpoint '../ai8x-synthesis/trained/ai85-svhn-tinierssd-qat8-q.pth.tar'
Optimizer Type: <class 'torch.optim.sgd.SGD'>
Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False}
Train dataset length: 28548

Test dataset length: 12251

Dataset sizes:
        training=25694
        validation=2854
        test=12251
--- test ---------------------
12251 samples (256 per mini-batch)
{'multi_box_loss': {'alpha': 2, 'neg_pos_ratio': 3}, 'nms': {'min_score': 0.2, 'max_overlap': 0.3, 'top_k': 20}}
Test: [   48/   48]    Loss 69.324744    mAP 0.750842
==> mAP: 0.75084    Loss: 69.325

Log file for this run: /home/nuzuegbunam/Projects/ARGUS/training/ai8x-training/logs/2023.04.28-115940/2023.04.28-115940.log
rotx-maxim commented 1 year ago

Thanks again! We will add this to the documentation.