NVIDIA-AI-IOT / tao_toolkit_recipes

Other
32 stars 15 forks source link

seems like SHAD dataset is no longer available? #1

Open chophilip21 opened 2 years ago

chophilip21 commented 2 years ago

It seems like SHAD dataset is not available anymore.

Below code returns error: wget -P ./ https://best.sjtu.edu.cn/Assets/userfiles/sys_eb538c1c-65ff-4e82-8e6a-a1ef01127fed/files/ZIP/Bend-train.rar

Do you have any other links available for this?

Also, if you want to use custom dataset to generate optical flow data, what are the procedures? Use NVIDIA Optical Flow (NVOF) SDK?

Tyler-D commented 2 years ago

1) We only have official link to SHAD dataset.

2) If you want to use NVOF SDK to generate optical flow (and you have turing or ampere devices), you could download the binary based on NVOF SDK with the action recognition notebook from NGC: wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.3.0/files/action_recognition_net/AppOFCuda

The binary is called in the preprocess script like this: https://github.com/NVIDIA-AI-IOT/tao_toolkit_recipes/blob/main/tao_action_recognition/data_generation/preprocess_SHAD.sh#L40

./AppOFCuda --input=${RGB_PATH_LIST[i]}/"*.png" --output=${OF_PATH_LIST[i]}/"flow" --preset=slow --gridSize=1

Tyler-D commented 2 years ago

But it is also fine to generate optical flow using opencv. TAO toolkit does not care where you get your optical flow vector.

Some reference implementation: https://github.com/yjxiong/temporal-segment-networks

chophilip21 commented 2 years ago

Thanks for quick follow up!

Before I follow your suggestions above, I wanted to test my custom data on 2D RGB settings for TAO training, and I have ran across some issues. I would like to hear your insights on this.

I have previously trained multiple TAO models, but this is first time training action_recognition_model. My data, however, is not humans, but they are 480P videos of cows, performing following classes of actions : Eating, Sitting, Walking, and Standing.

I realize there is a difference between my custom dataset and HMDB dataset, but I had no problem running your preprocess_HMDB_RGB.sh script on my custom data.

But TAO training gets terminated immediately when I run:

tao action_recognition train -e /workspace/spec/action_recognition_cow.txt -r /workspace/results/cow_activity -k nvidia_tlt The error message I get is this:

2021-12-20 16:42:05,000 [INFO] root: Registry: ['nvcr.io']
2021-12-20 16:42:05,047 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
Error executing job with overrides: ['output_dir=/workspace/results/cow_activity', 'encryption_key=nvidia_tlt']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 70, in main
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 22, in run_experiment
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py", line 29, in __init__
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/pl_ar_model.py", line 36, in _build_model
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/model/build_nn_model.py", line 14, in build_ar_model
AttributeError: 'NoneType' object has no attribute 'keys'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/action_recognition/scripts/train.py", line 76, in <module>
  File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/cv/super_resolution/scripts/configs/hydra_runner.py", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
2021-12-20 16:42:10,017 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

It appears like the problem is to do with data (being treated as NoneType..?). Let me know if you would like to see my config file too, but I cannot understand what I have missed exactly.

Tyler-D commented 2 years ago

You can share your config here.

chophilip21 commented 2 years ago

sure.

I have followed whatever I found under the official documentation, and did manual split of the videos into train and test folder.

model_config:
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 3
  input_type: 2d
  sample_rate: 1
  dropout_ratio: 0.0
train_config:
  optim:
    lr: 0.01
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [5, 15, 25]
    lr_decay: 0.1
  epochs: 30
  checkpoint_interval: 1
dataset_config:
  train_dataset_dir: /workspace/dataset/cow_activity/train
  val_dataset_dir: /workspace/dataset/cow_activity/test
  label_map:
    eating: 0
    lying: 1
    standing: 2
    walking: 3
  output_shape:
  - 224
  - 224
  batch_size: 16
  workers: 8
  augmentation_config:
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

Do you see anything off in the config itself? I assumed I cannot use any of the pretrained weights as those available are 5 classes sample based on human activities.

Tyler-D commented 2 years ago

The config looks good. But you should save the config to .yaml file instead of .txt

chophilip21 commented 2 years ago

ah okay. I thought the file extension was .txt like other models. Thanks, changing to .yaml works perfectly fine.

chophilip21 commented 2 years ago

sorry, one more question.

I am trying to export a test model and integrate it to my deepstream pipeline for testing, and I am facing unusual error here.

tao action_recognition export -k nvidia_tlt \
                              -e /workspace/spec/action_recognition_cow.yaml \
                              model=/workspace/cow_activity/ar_model_epoch=09-val_loss=2.15.tlt \
                              output_file=/workspace/cow_activity/test.etlt \ 

There should be no problem running above command and get .etlt model, but I am getting:

2021-12-21 13:02:08,726 [INFO] root: Registry: ['nvcr.io']
2021-12-21 13:02:08,770 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.21.11-py3
mismatched input '=' expecting <EOF>
See https://hydra.cc/docs/next/advanced/override_grammar/basic for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
2021-12-21 13:02:13,324 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The official documentation says I need to use model=$some_model instead of using commands like -m or -o, but it seems like that's not what the script wants. Any idea on this?

Tyler-D commented 2 years ago

remove the = in the model name ar_model_epoch=09-val_loss=2.15.tlt.

It's a known issue we mentioned in notebook:

2) "=" in the checkpoint file name should removed before using the checkpoint in command.

We will add this note in the doc.

chophilip21 commented 2 years ago

Ah I see, I think I found another possible bug that should be fixed.

The export command runs, and there is another error coming from the config file that I wanted to point out.

Error merging 'action_recognition_cow.yaml' with schema
Key 'train_config' not in 'ARExportExpConfig'
    full_key: train_config
    object_type=ARExportExpConfig

The train_config parameter is a default parameter inside the spec file, but it's wanting those to be removed, which seems to be strange. I was able to comment the line out and export the file to etlt, but I am assuming the default behaviour isn't like this.

Tyler-D commented 2 years ago

Emmm, the default behavior is like this. The train_config is not needed in export phase so we just remove it. You can see there is a export_rgb.yaml for export in the notebook. But your suggestion is good I think. It will be more friendly if customers can export the model with training.yaml. It contains everything export needs after all.

chophilip21 commented 2 years ago

Got it. thanks.

wlfAI commented 2 years ago

I encountered some problems when training the HMDB51 data set, and the file e could not be indexed, and there was always an error that the file does not exist my command is: tao action_recognition train -e /root/tao/resnet18.yaml -r /root/tao/result