Open Charan0502 opened 2 years ago
not sure what is going on. Looks like an error when doing data augmentation, can you share the data you are using that is causing this?
not sure what is going on. Looks like an error when doing data augmentation, can you share the data you are using that is causing this? output-20220525T090514Z-001.zip
Train Epoch: 1 [0/15 (0%)] Loss: 0.040052685886621
Train Epoch: 2 [0/15 (0%)] Loss: 0.019999586045742
It worked great on my machine, I would think your installation is broken. Maybe the packages you are using like albutation need updates.
Thankyou sir my problem was fixed previously I was training in colab now when I changed to my local system it works
Also why is the loss very low at starting epoch ?
Myabe because of the batch size, I am not sure.
Sir I created 2480 batches with batch size 10
training data: 2480 batches load models ready to train! Train Epoch: 1 [0/24800 (0%)] Loss: 0.105312615633011 Train Epoch: 1 [1000/24800 (4%)] Loss: 0.059226501733065
But I am still getting very less loss value I dont know if the result will be accurate after training What will be the solution for this sir ?
Have you looked at the tensor board?
On Sat, May 28, 2022 at 06:22 Charan0502 @.***> wrote:
Sir I created 2480 batches with batch size 10
training data: 2480 batches load models ready to train! Train Epoch: 1 [0/24800 (0%)] Loss: 0.105312615633011 Train Epoch: 1 [1000/24800 (4%)] Loss: 0.059226501733065
But I am still getting very less loss value I dont know if the result will be accurate after training What will be the solution for this sir ?
— Reply to this email directly, view it on GitHub https://github.com/NVlabs/Deep_Object_Pose/issues/255#issuecomment-1140264684, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JIBOGBJVRKO2BPAIHF3VMIMZ7ANCNFSM5V72ZR5Q . You are receiving this because you commented.Message ID: @.***>
When I try to run train.py/train2 I get attribute error
!python -m torch.distributed.launch --nproc_per_node=1 /content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py --network dope --epochs 2 --batchsize 1 --outf tmp/ --data /content/output/output_example/
Error:
start: 23:22:53.999588 load data: ['/content/output/output_example/'] load data: training data: 15 batches load models ready to train! Traceback (most recent call last): File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 613, in _runnetwork(epoch,trainingdata) File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 429, in _runnetwork for batch_idx, targets in enumerate(train_loader): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/utils_dope.py", line 266, in getitem keypoint_params=A.KeypointParams(format='xy',remove_invisible=False) AttributeError: module 'albumentations' has no attribute 'KeypointParams'
Killing subprocess 565 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py', '--local_rank=0', '--network', 'dope', '--epochs', '2', '--batchsize', '1', '--outf', 'tmp/', '--data', '/content/output/output_example/']' returned non-zero exit status 1.
How do you solve this problem?I had a similar problem.
When I try to run train.py/train2 I get attribute error
!python -m torch.distributed.launch --nproc_per_node=1 /content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py --network dope --epochs 2 --batchsize 1 --outf tmp/ --data /content/output/output_example/
Error:
start: 23:22:53.999588 load data: ['/content/output/output_example/'] load data: training data: 15 batches load models ready to train! Traceback (most recent call last): File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 613, in
_runnetwork(epoch,trainingdata)
File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py", line 429, in _runnetwork
for batch_idx, targets in enumerate(train_loader):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/utils_dope.py", line 266, in getitem
keypoint_params=A.KeypointParams(format='xy',remove_invisible=False)
AttributeError: module 'albumentations' has no attribute 'KeypointParams'
Killing subprocess 565 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/content/drive/MyDrive/Deep_Object_Pose/scripts/train2/train.py', '--local_rank=0', '--network', 'dope', '--epochs', '2', '--batchsize', '1', '--outf', 'tmp/', '--data', '/content/output/output_example/']' returned non-zero exit status 1.