czbiohub-sf / yogo

The "you only glance once" object detection model
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

Issues when running `yogo test` #153

Open zbarry opened 4 months ago

zbarry commented 4 months ago

Need to manually specify --wandb to avoid an error:


Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 43, in test_model
    log_to_wandb = args.wandb or len(args.wandb_resume_id) > 0
TypeError: object of type 'NoneType' has no len()

(this is fixed by force including a --wandb in the run command)

Models trained without normalization error out during test:

~~wandb snip~~
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 71, in test_model
    normalize_images=cfg["normalize_images"],
KeyError: 'normalize_images'

Think this line should be replaced with normalize_images=cfg.get("normalize_images", False)

zbarry commented 4 months ago

Also running out of memory with yogo test - I guess this is a batch size thing? (though it's not possible to specify with the test command). Curious that it was able to train just fine with the given configuration, though!

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.32it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
zbarry commented 4 months ago

Interestingly, when I lower the batch size in the test_model.test_model function to 4, it errors slightly differently:

loading dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.02s/it]
loading test dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.11it/s]
Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 157, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 425.94 GiB (GPU 0; 14.58 GiB total capacity; 309.89 MiB already allocated; 12.92 GiB free; 1.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
paul-lebel commented 4 months ago

Created a branch for this, thanks @zbarry!

First two bugs are easy fixes, but I also ran into error when running yogo test <path to pth> <path to dataset defn>

 File "/home/paul.lebel/Documents/github/yogo/yogo/data/yogo_dataloader.py", line 215, in get_dataloader
    rank = torch.distributed.get_rank()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1746, in get_rank
    default_pg = _get_default_group()
  File "/home/paul.lebel/.conda/envs/yogo_conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

@Axel-Jacobsen , should it be checking for ValueError here instead? PyTorch docs does say it should be a RuntimeError though...

Of course, the base problem is that init_process_group is never called in test_model.py?

paul-lebel commented 4 months ago

PyTorch definitely raises ValueError here 🤔

zbarry commented 4 months ago

Update on the CUDA OOM error - I lowered my test dataset size to just 10 images down from ~6k, and I'm still getting OOMs during the torchmetrics calculation:

Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 18, in main
    do_model_test(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 116, in do_model_test
    test_model(args)
  File "/home/zachary/yogo/yogo/utils/test_model.py", line 90, in test_model
    test_metrics = Trainer.test(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 493, in test
    test_metrics.update(outputs.detach(), labels.detach())
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/metrics.py", line 158, in update
    self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long())
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/collections.py", line 220, in update
    m.update(*args, **m_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 492, in wrapped_func
    raise err
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/metric.py", line 482, in wrapped_func
    update(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/classification/precision_recall_curve.py", line 368, in update
    state = _multiclass_precision_recall_curve_update(
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 204, in _bincount
    mesh = torch.arange(minlength, device=x.device).repeat(len(x), 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.02 GiB (GPU 0; 14.58 GiB total capacity; 29.48 MiB already allocated; 14.28 GiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Adding print(fps, fls, fps.shape, fls.shape) just before self.prediction_metrics.update(fps[:, 5:], fls[:, 5:].squeeze().long()) in yogo/metrics.py:

tensor([[ 0.7696,  0.6382,  0.8076,  ...,  0.8735,  2.6758, -2.5820],
        [ 0.4599,  0.7504,  0.4981,  ...,  0.8848,  2.7930, -2.8516],
        [ 0.0086,  0.0440,  0.0473,  ...,  0.6221,  2.8320, -3.1641],
        ...,
        [ 0.9490,  0.9010,  0.9870,  ...,  0.8540,  2.5430, -2.8633],
        [ 0.2406,  0.9212,  0.2791,  ...,  0.5068,  2.4668, -2.8008],
        [ 0.3581,  0.9416,  0.3961,  ...,  0.6982,  2.5469, -3.0234]],
       device='cuda:0') tensor([[1.0000, 0.8255, 0.0111, 0.8634, 0.0491, 0.0000],
        [1.0000, 0.4667, 0.0153, 0.5046, 0.0532, 0.0000],
        [1.0000, 0.0097, 0.0431, 0.0477, 0.0810, 0.0000],
        ...,
        [1.0000, 0.9481, 0.9023, 0.9861, 0.9403, 0.0000],
        [1.0000, 0.2449, 0.9204, 0.2829, 0.9583, 0.0000],
        [1.0000, 0.3593, 0.9398, 0.3972, 0.9778, 0.0000]], device='cuda:0') torch.Size([873, 7]) torch.Size([873, 6])
Axel-Jacobsen commented 4 months ago

back from vacation - addressing this now!

Axel-Jacobsen commented 4 months ago

OK looks like an issue w/ torchmetrics. I've found them to be finicky.

@zbarry, it would be very helpful to know some characteristics of your dataset - in #150, you mention your images are 512x512. Roughly how many object per image are you expecting?

Axel-Jacobsen commented 4 months ago

Also, how many classes do you have?

Axel-Jacobsen commented 4 months ago

If you could post a link to download your dataset, if it's public, that would be very helpful 😁

Axel-Jacobsen commented 4 months ago

I have a hunch that the multiclass precision recall metrics are just making a tonne of bins, requiring a tonne of memory. From the traceback above,

...
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 486, in _multiclass_precision_recall_curve_update
    return update_fn(preds, target, num_classes, thresholds)
  File "/opt/conda/lib/python3.10/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 507, in _multiclass_precision_recall_curve_update_vectorized
    bins = _bincount(unique_mapping.flatten(), minlength=4 * num_classes * len_t)
...

I'm fixing these at 500 in add-multiclass-pr-thresholds-limit. Perhaps that'll fix it?

zbarry commented 3 months ago

Hey @Axel-Jacobsen - thanks for following up! I will get around to this more completely tomorrow (before I myself head out on vacation, haha), but for now at least some answers:

Axel-Jacobsen commented 3 months ago

Sweet! Thank you. I'm waiting for a GPU rental service to approve me 🙄 so I'm a bit delayed w/ reproducing issues. But! Hopefully I'll be able to finally start fixing these issues soon.

zbarry commented 3 months ago

Hi! I ended up figuring out what was happening here - there were too many thresholds on MulticlassROC which was causing a huge increase in memory consumption. I reduced it way down, and that OOM issue went away.