BazUCD commented 2 years ago

Hi I am attempting to run a the training script and generate the belief maps from train2/train.py in order to debug but I am getting this error:

start: 18:18:30.781464 load data: ['/home/user/Downloads/Spanner2'] load data: training data: 2000 batches load models ready to train! Traceback (most recent call last): File "train.py", line 606, in _runnetwork(epoch,trainingdata) File "train.py", line 422, in _runnetwork for batch_idx, targets in enumerate(train_loader): File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/user/.local/lib/python3.6/site-packages/torch/_utils.py", line 434, in reraise raise exception IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/user/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/user/catkin_pcl_new/src/dope/scripts/train2/utils_dope.py", line 321, in getitem save=False, File "/home/user/catkin_pcl_new/src/dope/scripts/train2/utils_dope.py", line 593, in CreateBeliefMap p = [point[numb_point][1],point[numb_point][0]] IndexError: list index out of range

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17249) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-04-06_18:18:39 host : user-User rank : 0 (local_rank: 0) exitcode : 1 (pid: 17249) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html I am unsure what is causing this error as I have the correct versions of Pytorch install based on requirements.txt. Is there any common mistakes I could be making?

TontonTremblay commented 2 years ago

Could you share an example of json file you are using in your dataset. It looks like

p = [point[numb_point][1],point[numb_point][0]]

point looks empty of the dimensions are wrong. @mintar refactored the data format a little bit, I did not check if it was compatible with train2/train.py? But I will try to check soon.

@BazUCD Did you try to use the original training script?

BazUCD commented 2 years ago

Hi @TontonTremblay thanks for the quick reply. Heres an example of my .json files with the associated png as well: 000000 span1

I've used the original training script and generated some weights but was unable to detect anything so after your recommendation from #238 I have been trying to generate the belief maps using train2

TontonTremblay commented 2 years ago

This looks correct, but your object has a symmetry in it. https://github.com/NVlabs/Deep_Object_Pose/tree/master/scripts/nvisii_data_gen#handling-objects-with-symmetries you should look into this from Martin.

andrewyguo commented 2 years ago

I encountered a similar issue. The training script expects the "projected_cuboid" field to contain 9 points. The last point being the point under"projected_cuboid_centroid".

In your case, you can add something like projected_cuboid_keypoints.append(obj['projected_cuboid_centroid']) right below line 228 in utils_dope.py. I did this and it worked for me.

NVlabs / Deep_Object_Pose

Generating Belief Maps using train2/train.py #242

train.py FAILED