Open zbarry opened 4 months ago
Also running out of memory with yogo test
- I guess this is a batch size thing? (though it's not possible to specify with the test
command). Curious that it was able to train just fine with the given configuration, though!
loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.32it/s]
Traceback (most recent call last):
File "/opt/conda/bin/yogo", line 8, in <module>
sys.exit(main())
File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
do_model_test(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
test_model(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
test_metrics = Trainer.test(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/train.py", line 493, in test
test_metrics.update(outputs.detach(), labels.detach())
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
m.update(*args, **m_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
raise err
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
update(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
state = _multiclass_precision_recall_curve_update(
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
return update_fn(preds, target, num_classes, thresholds)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Interestingly, when I lower the batch size in the test_model.test_model
function to 4, it errors slightly differently:
loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.11it/s]
Traceback (most recent call last):
File "/opt/conda/bin/yogo", line 8, in <module>
sys.exit(main())
File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
do_model_test(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
test_model(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
test_metrics = Trainer.test(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/train.py", line 493, in test
test_metrics.update(outputs.detach(), labels.detach())
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
m.update(*args, **m_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
raise err
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
update(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
state = _multiclass_precision_recall_curve_update(
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
return update_fn(preds, target, num_classes, thresholds)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Created a branch for this, thanks @zbarry!
First two bugs are easy fixes, but I also ran into error when running yogo test <path to pth> <path to dataset defn>
File "/home/paul.lebel/Documents/github/yogo/yogo/data/yogo_dataloader.py", line 215, in get_dataloader
rank = torch.distributed.get_rank()
File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
default_pg = _get_default_group()
File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
@Axel-Jacobsen , should it be checking for ValueError
here instead? PyTorch docs does say it should be a RuntimeError though...
Of course, the base problem is that init_process_group
is never called in test_model.py?
PyTorch definitely raises ValueError here 🤔
Update on the CUDA OOM error - I lowered my test dataset size to just 10 images down from ~6k, and I'm still getting OOMs during the torchmetrics calculation:
Traceback (most recent call last):
File "/opt/conda/bin/yogo", line 8, in <module>
sys.exit(main())
File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
do_model_test(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
test_model(args)
File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
test_metrics = Trainer.test(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/train.py", line 493, in test
test_metrics.update(outputs.detach(), labels.detach())
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zachary/yogo/yogo/metrics.py", line 158, in update
self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
m.update(*args, **m_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
raise err
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
update(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
state = _multiclass_precision_recall_curve_update(
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
return update_fn(preds, target, num_classes, thresholds)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.02 GiB (GPU 0; 14.58 GiB total capacity; 29.48 MiB already allocated; 14.28 GiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Adding print(fps, fls, fps.shape, fls.shape)
just before self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
in yogo/metrics.py
:
tensor([[ 0.7696, 0.6382, 0.8076, ..., 0.8735, 2.6758, -2.5820],
[ 0.4599, 0.7504, 0.4981, ..., 0.8848, 2.7930, -2.8516],
[ 0.0086, 0.0440, 0.0473, ..., 0.6221, 2.8320, -3.1641],
...,
[ 0.9490, 0.9010, 0.9870, ..., 0.8540, 2.5430, -2.8633],
[ 0.2406, 0.9212, 0.2791, ..., 0.5068, 2.4668, -2.8008],
[ 0.3581, 0.9416, 0.3961, ..., 0.6982, 2.5469, -3.0234]],
device='cuda:0') tensor([[1.0000, 0.8255, 0.0111, 0.8634, 0.0491, 0.0000],
[1.0000, 0.4667, 0.0153, 0.5046, 0.0532, 0.0000],
[1.0000, 0.0097, 0.0431, 0.0477, 0.0810, 0.0000],
...,
[1.0000, 0.9481, 0.9023, 0.9861, 0.9403, 0.0000],
[1.0000, 0.2449, 0.9204, 0.2829, 0.9583, 0.0000],
[1.0000, 0.3593, 0.9398, 0.3972, 0.9778, 0.0000]], device='cuda:0') torch.Size([873, 7]) torch.Size([873, 6])
back from vacation - addressing this now!
OK looks like an issue w/ torchmetrics. I've found them to be finicky.
@zbarry, it would be very helpful to know some characteristics of your dataset - in #150, you mention your images are 512x512. Roughly how many object per image are you expecting?
Also, how many classes do you have?
If you could post a link to download your dataset, if it's public, that would be very helpful 😁
I have a hunch that the multiclass precision recall metrics are just making a tonne of bins, requiring a tonne of memory. From the traceback above,
...
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
return update_fn(preds, target, num_classes, thresholds)
File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
...
I'm fixing these at 500 in add-multiclass-pr-thresholds-limit
. Perhaps that'll fix it?
Hey @Axel-Jacobsen - thanks for following up! I will get around to this more completely tomorrow (before I myself head out on vacation, haha), but for now at least some answers:
Sweet! Thank you. I'm waiting for a GPU rental service to approve me 🙄 so I'm a bit delayed w/ reproducing issues. But! Hopefully I'll be able to finally start fixing these issues soon.
Hi! I ended up figuring out what was happening here - there were too many thresholds on MulticlassROC
which was causing a huge increase in memory consumption. I reduced it way down, and that OOM issue went away.
Need to manually specify
--wandb
to avoid an error:(this is fixed by force including a
--wandb
in the run command)Models trained without normalization error out during test:
Think this line should be replaced with
normalize_images=cfg.get("normalize_images", False)