albumentations-team / autoalbument

AutoML for image augmentation. AutoAlbument uses the Faster AutoAugment algorithm to find optimal augmentation policies. Documentation - https://albumentations.ai/docs/autoalbument/
https://albumentations.ai/docs/autoalbument/
MIT License
198 stars 20 forks source link

Performance seems to be very low #33

Open scribblepad opened 2 years ago

scribblepad commented 2 years ago

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml. The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument? Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search: image

Segments from search.yaml:

architecture: Unet encoder_architecture: resnet18 pretrained: true

dataloader: target: torch.utils.data.DataLoader batch_size: 8 shuffle: true num_workers: 16 pin_memory: true drop_last: true

ihamdi commented 2 years ago

Same here. Trying the cifar10 example and its taking at least 5s/iteration. 390 itr/epoch means like half an hour per epoch. I can only imagine how slow it it will be if I try to use it on my x-ray classification task with 6000 high-res images.

I'm gonna wait just to see the result out of curiosity but otherwise not usable. Going to look into the Faster Auto Augment this is based on or even the older Rand Augment or Auto Augment

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml. The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument? Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search: image

Segments from search.yaml:

architecture: Unet encoder_architecture: resnet18 pretrained: true

dataloader: target: torch.utils.data.DataLoader batch_size: 8 shuffle: true num_workers: 16 pin_memory: true drop_last: true

ihamdi commented 2 years ago

I'm trying to explore the usage the AutoAlbument for semantic segmentation task with default generated search.yaml. The custom dataset has around 29000 RGB images and corresponding masks (height x width - 512 x 512). I'm running it on a single A100 GPU. I'm using max batch size of 8, I could fit only so much in memory without OOM errors. I see GPU getting utilized fine, utilization fluctuates between (35%->75%->99%->100%).

Issue

Looks like the approximate time required to complete autoalbument-search seems to be close 5 days (for 20 epochs) based on the output below, which seems to be too high. Is there a better optimized way to obtain augmentation policies generated by AutoAlbument? Because it's too expensive to run it for 5 continuous days.

Current Output of autoalbument-search: image

Segments from search.yaml:

architecture: Unet encoder_architecture: resnet18 pretrained: true

dataloader: target: torch.utils.data.DataLoader batch_size: 8 shuffle: true num_workers: 16 pin_memory: true drop_last: true

Running the same cifar10 example with batch size of 128 on RTX2070 (6GB) takes the same amount of time/iteration as using a 24 GB RTX3090 and a batch size of 640. I think there's something limiting how quickly the iterations happen in their code.

siddagra commented 8 months ago

I think not being able to use AMP (even if u try to use it u get an error on Albumentations' end) might be hurting performance and batch size. Also, perhaps Pytorch Lightning manual mode is just a lot slower. Need to test that. It also seems like it is doing a generative step and a discriminative step. So perhaps that is slowing it a bit; but it really shouldn't slow it down too much, as inference is much faster than training as gradients are not required. Lastly, it does not seem to matter what size of model you use; speed is similar; which suggests a CPU bottleneck; unsure why one would get such a bottleneck on pytorch-lightning.

Will similarly explore a bit and see if I can pin down the exact cause and otherwise switch to some of the better more recent methods. (DADA, Adv AA, official faster auto-augment, RandAugment search).

Maybe I can gain insights on how to train/search policies, in general, using Albumentations, till now I had to edit each augmentation by hand to try something like RandAugment (varying magnitude), would be nice to have a policy parser that I can generate programmatically to vary the magnitude.

saigontrade88 commented 4 months ago

You can reduce the training set size to 4,000 as the authors of Faster AA show in their paper.