NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
1.03k stars 287 forks source link

AttributeError: Caught AttributeError in DataLoader worker process 0. #255

Open Charan0502 opened 2 years ago

Charan0502 commented 2 years ago

When I try to run train.py/train2 I get attribute error

!python -m torch.distributed.launch --nproc_per_node=1 /content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py --network dope --epochs 2 --batchsize 1 --outf tmp/ --data /content/output/output_example/

Error:

start: 23:22:53.999588 load data: ['/content/output/output_example/'] load data: training data: 15 batches load models ready to train! Traceback (most recent call last): File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 613, in _runnetwork(epoch,trainingdata) File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 429, in _runnetwork for batch_idx, targets in enumerate(train_loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/utils_dope.py", line 266, in getitem keypoint_params=A.KeypointParams(format='xy',remove_invisible=False) AttributeError: module 'albumentations' has no attribute 'KeypointParams'

Killing subprocess 565 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py', '--local_rank=0', '--network', 'dope', '--epochs', '2', '--batchsize', '1', '--outf', 'tmp/', '--data', '/content/output/output_example/']' returned non-zero exit status 1.

TontonTremblay commented 2 years ago

not sure what is going on. Looks like an error when doing data augmentation, can you share the data you are using that is causing this?

Charan0502 commented 2 years ago

not sure what is going on. Looks like an error when doing data augmentation, can you share the data you are using that is causing this? output-20220525T090514Z-001.zip

TontonTremblay commented 2 years ago
Train Epoch: 1 [0/15 (0%)]  Loss: 0.040052685886621
Train Epoch: 2 [0/15 (0%)]  Loss: 0.019999586045742

It worked great on my machine, I would think your installation is broken. Maybe the packages you are using like albutation need updates.

Charan0502 commented 2 years ago

Thankyou sir my problem was fixed previously I was training in colab now when I changed to my local system it works

Also why is the loss very low at starting epoch ?

TontonTremblay commented 2 years ago

Myabe because of the batch size, I am not sure.

Charan0502 commented 2 years ago

Sir I created 2480 batches with batch size 10

training data: 2480 batches load models ready to train! Train Epoch: 1 [0/24800 (0%)] Loss: 0.105312615633011 Train Epoch: 1 [1000/24800 (4%)] Loss: 0.059226501733065

But I am still getting very less loss value I dont know if the result will be accurate after training What will be the solution for this sir ?

TontonTremblay commented 2 years ago

Have you looked at the tensor board?

On Sat, May 28, 2022 at 06:22 Charan0502 @.***> wrote:

Sir I created 2480 batches with batch size 10

training data: 2480 batches load models ready to train! Train Epoch: 1 [0/24800 (0%)] Loss: 0.105312615633011 Train Epoch: 1 [1000/24800 (4%)] Loss: 0.059226501733065

But I am still getting very less loss value I dont know if the result will be accurate after training What will be the solution for this sir ?

— Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/255#issuecomment-1140264684, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIBOGBJVRKO2BPAIHF3VMIMZ7ANCNFSM5V72ZR5Q . You are receiving this because you commented.Message ID: @.***>

Zhang123qian commented 2 years ago

When I try to run train.py/train2 I get attribute error

!python -m torch.distributed.launch --nproc_per_node=1 /content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py --network dope --epochs 2 --batchsize 1 --outf tmp/ --data /content/output/output_example/

Error:

start: 23:22:53.999588 load data: ['/content/output/output_example/'] load data: training data: 15 batches load models ready to train! Traceback (most recent call last): File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 613, in _runnetwork(epoch,trainingdata) File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 429, in _runnetwork for batch_idx, targets in enumerate(train_loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/utils_dope.py", line 266, in getitem keypoint_params=A.KeypointParams(format='xy',remove_invisible=False) AttributeError: module 'albumentations' has no attribute 'KeypointParams'

Killing subprocess 565 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py', '--local_rank=0', '--network', 'dope', '--epochs', '2', '--batchsize', '1', '--outf', 'tmp/', '--data', '/content/output/output_example/']' returned non-zero exit status 1.

How do you solve this problem?I had a similar problem.