NVIDIA / tao_tutorials

Quick start scripts and tutorial notebooks to get started with TAO Toolkit
Apache License 2.0
46 stars 12 forks source link

incorrect evaluation results visualized for CenterPose #4

Closed monajalal closed 8 months ago

monajalal commented 8 months ago

Hello

I trained TAO CenterPose using the centerpose.ipynb notebook. However, the evaluation results seems weird and not correct.

Could you please provide or update the notebook with the results of your evaluation?

I am not sure how this could be fixed or what has caused it?

Validation: 0it [00:00, ?it/s]
Validation:   0%|                                        | 0/47 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                           | 0/47 [00:00<?, ?it/s]
Epoch 39:  64%|▋| 81/127 [01:09<00:39,  1.17it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  65%|▋| 82/127 [01:09<00:38,  1.18it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  65%|▋| 83/127 [01:09<00:37,  1.19it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  66%|▋| 84/127 [01:10<00:36,  1.19it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  67%|▋| 85/127 [01:10<00:34,  1.20it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  68%|▋| 86/127 [01:11<00:33,  1.21it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  69%|▋| 87/127 [01:11<00:32,  1.22it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  69%|▋| 88/127 [01:12<00:31,  1.22it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  70%|▋| 89/127 [01:12<00:30,  1.23it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  71%|▋| 90/127 [01:12<00:29,  1.23it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  72%|▋| 91/127 [01:13<00:29,  1.24it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  72%|▋| 92/127 [01:13<00:28,  1.25it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  73%|▋| 93/127 [01:14<00:27,  1.25it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  74%|▋| 94/127 [01:14<00:26,  1.26it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  75%|▋| 95/127 [01:15<00:25,  1.27it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  76%|▊| 96/127 [01:15<00:24,  1.27it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  76%|▊| 97/127 [01:15<00:23,  1.28it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  77%|▊| 98/127 [01:16<00:22,  1.29it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  78%|▊| 99/127 [01:16<00:21,  1.29it/s, loss=17.2, v_num=0, train_loss
Epoch 39:  79%|▊| 100/127 [01:17<00:20,  1.30it/s, loss=17.2, v_num=0, train_los
Epoch 39:  80%|▊| 101/127 [01:17<00:19,  1.30it/s, loss=17.2, v_num=0, train_los
Epoch 39:  80%|▊| 102/127 [01:17<00:19,  1.31it/s, loss=17.2, v_num=0, train_los
Epoch 39:  81%|▊| 103/127 [01:18<00:18,  1.32it/s, loss=17.2, v_num=0, train_los
Epoch 39:  82%|▊| 104/127 [01:18<00:17,  1.32it/s, loss=17.2, v_num=0, train_los
Epoch 39:  83%|▊| 105/127 [01:19<00:16,  1.33it/s, loss=17.2, v_num=0, train_los
Epoch 39:  83%|▊| 106/127 [01:19<00:15,  1.33it/s, loss=17.2, v_num=0, train_los
Epoch 39:  84%|▊| 107/127 [01:20<00:14,  1.34it/s, loss=17.2, v_num=0, train_los
Epoch 39:  85%|▊| 108/127 [01:20<00:14,  1.34it/s, loss=17.2, v_num=0, train_los
Epoch 39:  86%|▊| 109/127 [01:20<00:13,  1.35it/s, loss=17.2, v_num=0, train_los
Epoch 39:  87%|▊| 110/127 [01:21<00:12,  1.35it/s, loss=17.2, v_num=0, train_los
Epoch 39:  87%|▊| 111/127 [01:21<00:11,  1.36it/s, loss=17.2, v_num=0, train_los
Epoch 39:  88%|▉| 112/127 [01:22<00:11,  1.36it/s, loss=17.2, v_num=0, train_los
Epoch 39:  89%|▉| 113/127 [01:22<00:10,  1.37it/s, loss=17.2, v_num=0, train_los
Epoch 39:  90%|▉| 114/127 [01:23<00:09,  1.37it/s, loss=17.2, v_num=0, train_los
Epoch 39:  91%|▉| 115/127 [01:23<00:08,  1.38it/s, loss=17.2, v_num=0, train_los
Epoch 39:  91%|▉| 116/127 [01:23<00:07,  1.38it/s, loss=17.2, v_num=0, train_los
Epoch 39:  92%|▉| 117/127 [01:24<00:07,  1.39it/s, loss=17.2, v_num=0, train_los
Epoch 39:  93%|▉| 118/127 [01:24<00:06,  1.39it/s, loss=17.2, v_num=0, train_los
Epoch 39:  94%|▉| 119/127 [01:25<00:05,  1.40it/s, loss=17.2, v_num=0, train_los
Epoch 39:  94%|▉| 120/127 [01:25<00:04,  1.40it/s, loss=17.2, v_num=0, train_los
Epoch 39:  95%|▉| 121/127 [01:26<00:04,  1.41it/s, loss=17.2, v_num=0, train_los
Epoch 39:  96%|▉| 122/127 [01:26<00:03,  1.41it/s, loss=17.2, v_num=0, train_los
Epoch 39:  97%|▉| 123/127 [01:26<00:02,  1.41it/s, loss=17.2, v_num=0, train_los
Epoch 39:  98%|▉| 124/127 [01:27<00:02,  1.42it/s, loss=17.2, v_num=0, train_los
Epoch 39:  98%|▉| 125/127 [01:27<00:01,  1.42it/s, loss=17.2, v_num=0, train_los
Epoch 39:  99%|▉| 126/127 [01:28<00:00,  1.43it/s, loss=17.2, v_num=0, train_los
Epoch 39: 100%|█| 127/127 [01:28<00:00,  1.44it/s, loss=17.2, v_num=0, train_los
 Validation 3DIoU : 0.0

 Validation 2DMPE : 0.23278719567310138

Epoch 39: 100%|█| 127/127 [01:28<00:00,  1.44it/s, loss=17.2, v_num=0, train_los
                                                                                Train and Val metrics generated.
Epoch 39: 100%|█| 127/127 [01:29<00:00,  1.43it/s, loss=17.2, v_num=0, train_losTraining loop in progress
`Trainer.fit` stopped: `max_epochs=40` reached.
Epoch 39: 100%|█| 127/127 [01:29<00:00,  1.42it/s, loss=17.2, v_num=0, train_los
Training loop complete.
Training finished successfully
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: PASS
2024-02-29 16:03:21,618 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Screenshot from 2024-02-29 16-13-29 Screenshot from 2024-02-29 16-13-35

This is what I got for evaluation

2024-02-29 16:08:41,839 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-02-29 16:08:41,880 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0
2024-02-29 16:08:41,897 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
sys:1: UserWarning: 
'evaluate.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:-1: UserWarning: 
'evaluate.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Evaluate results will be saved at: /results
Starting CenterPose evaluation
Initializing test cereal_box data.
Loaded test 185 samples
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/lightning_logs
Initializing test cereal_box data.
Loaded test 185 samples
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing DataLoader 0:   0%|                              | 0/47 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Testing DataLoader 0: 100%|█████████████████████| 47/47 [00:17<00:00,  2.74it/s]**********************Start logging Evaluation Results **********************
*************** 3D IoU *****************
3D IoU: 0.09549
*************** 2D MPE *****************
2D MPE: 0.23279
Evaluation metrics generated.
Testing DataLoader 0: 100%|█████████████████████| 47/47 [00:17<00:00,  2.74it/s]
Evaluation finished successfully
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: PASS
2024-02-29 16:09:07,611 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This is what I got for visualization cell:

2024-02-29 16:09:24,494 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-02-29 16:09:24,537 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0
2024-02-29 16:09:24,553 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
sys:1: UserWarning: 
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:-1: UserWarning: 
'infer.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Inference results will be saved at: /results/inference
Starting CenterPose inference
Initializing 185 inference images.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/inference/lightning_logs
Initializing 185 inference images.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0:   0%|                           | 0/47 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Predicting DataLoader 0: 100%|██████████████████| 47/47 [00:20<00:00,  2.26it/s]
Inference finished successfully.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: PASS
2024-02-29 16:09:53,877 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

and

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/
Collecting matplotlib==3.3.3
  Downloading matplotlib-3.3.3.tar.gz (37.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.9/37.9 MB 8.0 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: cycler>=0.10 in /home/mona/.local/lib/python3.10/site-packages (from matplotlib==3.3.3) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/mona/.local/lib/python3.10/site-packages (from matplotlib==3.3.3) (1.4.4)
Requirement already satisfied: numpy>=1.15 in /home/mona/anaconda3/envs/sdgpose/lib/python3.10/site-packages (from matplotlib==3.3.3) (1.25.2)
Requirement already satisfied: pillow>=6.2.0 in /home/mona/anaconda3/envs/sdgpose/lib/python3.10/site-packages (from matplotlib==3.3.3) (10.0.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/mona/anaconda3/envs/sdgpose/lib/python3.10/site-packages (from matplotlib==3.3.3) (3.1.1)
Requirement already satisfied: python-dateutil>=2.1 in /home/mona/.local/lib/python3.10/site-packages (from matplotlib==3.3.3) (2.8.2)
Requirement already satisfied: six>=1.5 in /home/mona/anaconda3/envs/sdgpose/lib/python3.10/site-packages (from python-dateutil>=2.1->matplotlib==3.3.3) (1.15.0)
Building wheels for collected packages: matplotlib
  Building wheel for matplotlib (setup.py) ... done
  Created wheel for matplotlib: filename=matplotlib-3.3.3-cp310-cp310-linux_x86_64.whl size=8463990 sha256=088646468530213880688be2d4375b3d7ed79de3679cd9e467f8fa8b4b37ade4
  Stored in directory: /tmp/pip-ephem-wheel-cache-256m56c6/wheels/33/5f/fa/7686ebdceeeb90490e122e35f2a9e6c09affb891787d0921a6
Successfully built matplotlib
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.7.1
    Uninstalling matplotlib-3.7.1:
      Successfully uninstalled matplotlib-3.7.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open3d 0.16.0 requires nbformat==5.5.0, but you have nbformat 5.9.2 which is incompatible.
Successfully installed matplotlib-3.3.3

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
Arun-George-Zachariah commented 8 months ago

Hi @monajalal. Can you confirm if you are using the default settings? Note that the sample spec/default settings are not meant to produce SOTA (state-of-the-art) accuracy on the Objectron dataset. To reproduce SOTA, you should set TRAIN_FR as 15, epoch as 140, and DATA_DOWNLOAD as -1 to match the original parameters.

monajalal commented 8 months ago

@Arun-George-Zachariah Thank you so much for looking into this. Yes, I only followed the current setting on the notebook. I will try the new setting and will report back.