P3 Training Fail: ValueError: num_samples should be a positive integer value, but got num_samples=0

cardboardcode commented 2 years ago

Issue Description

Encountered the following error when attempting to train a Precision Level 3 MaskRCNN model using EPD. This error comes after having integrated the .yaml parser within P3Trainer.py.

Traceback (most recent call last):
  File "tools/train_net.py", line 201, in <module>
    main()
  File "tools/train_net.py", line 194, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 72, in train
    start_iter=arguments["iteration"],
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/data/build.py", line 164, in make_data_loader
    sampler = make_data_sampler(dataset, shuffle, is_distributed)
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/data/build.py", line 64, in make_data_sampler
    sampler = torch.utils.data.sampler.RandomSampler(dataset)
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

Expected Behaviour

The training is supposed to proceed without any errors.

Actual Behaviour

The training fails the aforementioned error in terminal.

Error Source

Currently, the integration of the .yaml parser in P3Trainer.py seems to be the root cause.

[ Update as of 20220812 ]: The integration of the parser is not the root cause. With the EPD v0.2.2 P3 training workflow failing as well. It can be deduced that the cause should be narrowed to unknown dependency conflicts.

cardboardcode commented 2 years ago

Debug Update

Running the training workflow on the ros-industrial EPD v0.2.2 yields the same error message.

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-08-12 10:59:03,245 maskrcnn_benchmark.utils.miscellaneous INFO: Saving labels mapping into ./weights/custom/labels.json
Traceback (most recent call last):
  File "tools/train_net.py", line 201, in <module>
    main()
  File "tools/train_net.py", line 194, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 72, in train
    start_iter=arguments["iteration"],
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/data/build.py", line 164, in make_data_loader
    sampler = make_data_sampler(dataset, shuffle, is_distributed)
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/maskrcnn_benchmark-0.1-py3.6-linux-x86_64.egg/maskrcnn_benchmark/data/build.py", line 64, in make_data_sampler
    sampler = torch.utils.data.sampler.RandomSampler(dataset)
  File "/home/cardboardvoice/anaconda3/envs/p3_trainer/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

Conclusion

The integration of the .yaml parser is NOT the root cause.

It can be deduced that the cause should be narrowed to unknown dependency conflicts. For a long-term solution to this problem, we will be looking at progress done under #4 and #3.

cardboardcode commented 2 years ago

Debug Update

The conclusion from the previous Debug Update is further reinforced by the following test.

Test

Unreleased dockerized training proceeds without the aforementioned error. Adhering to expected behaviour.

Unreleased dockerized exporter proceeds without the aforementioned error as well.

Conclusion

The cause is due to yet another hidden dependency conflict which is prevented once dockerized.

Aiming to close this issue under v0.3.0 - Minor Pull Request. Will link this once the pull request is started.

cardboardcode commented 2 years ago

This issue is resolved with https://github.com/ros-industrial/easy_perception_deployment/pull/56. Closing.

cardboardcode / epd_core