Confusing about steps, batch, epoch calculation for my GPU

kunnareekr commented 4 years ago

I am very new to this. I am using GPU 11 GB (RTX 2080 Ti). If I have 300 images, with about 9,100 bounding boxes. Image size = RGB 2000*2000.

I am running RetinaNet using, ImageNet pretrained and only from scratch for comparison. So, trainable parameters about 36M. I want to test for 50 epochs with batch 8/16/32.** In my research I have to test it with different dataset and batch size.

From my understanding, batch size will be limited by GPU memory. Step size for each epoch can be calculated by ; No samples/ batch size Epoch we can freely set it. (Please kindly tell me if this is wrong)

My questions and confusion are in the following; 1. In RetinaNet and basic theory, which one should be sample size? Number of images OR Number of bounding boxes

2. Step size should be from the mentioned equation or we have other technique to set this number? Ex. If I have 1000 samples, I want to use batch 8 so step size to feed all images in one epoch is 1000/16 = 125 I found that some people use step size = 1000, even they have smaller samples or bigger samples. Which one is correct?, this make me very confused.

3. From my experiment, I found that no. of parameters, steps per epoch and batch can affect to GPU run out of memory, I think image size also.

I test with my dataset on my 11 GB GPU, using no. of bounding boxes to calculate step size so batch/step = 8/1138, 16/569, 32/285 ; * I CANNOT use all of this combination, only batch 7 as maximum.
using no. of images to calculate step size also not work. so batch/step = 8/38, 16/19, 32/10, only batch = 7 as maximum too
using above combination on Tesla p100 16GB is fine until batch = 16. (This is on colab)
both on my GPU and colab got this warning, but still can run

My GPU 11GB

tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.21GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Colab Tesla p100 16 GB

tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 712.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

With the huge image size of my training dataset, Do you think this is strange or normal results that I can use maximum batch = 7 ?

4. Please kindly suggest and share your experiment, that how big of your training data and image size, batch/step/epoch used

I think my image size may be to big, and dataset is small. Now, I am resizing images and bbox coordinate. Then, I will test it again that I can use batch more than 7 or not. After resizing I will have about 1,200 images with 9100 bbox.
I am doing my experiment on UAV images/orthophoto it take 10-20 mins per image patch to create annotation. So now I have small dataset. I didn't resize or resolution of the orthophoto, I would like to keep 2 cm GSD. By the way, I will increase number of training data soon.

5. Should I rent GPU or cloud, If 11GB is not enough after I resize images? By the way, I have no money......I am using GPU of my lab/university.

6. Is it possible that I made some mistake on coding or configuration, so I can use small batch? But I didn't adjust anything, just use default scripts.

This is my scripts --random-transform --weights pretrained_model.h5 --batch-size 16 --steps 568 --epochs 50 --compute-val-loss csv training.csv class.csv val_annotation.csv

In this article He noted that "Retinanet is heavy on computation. It will require at least 7–8 GBs of GPU memory for a batch size of 4 (224x224) images, he used 3780 images"

So my images size may be affect GPU ran out of memory.......

P.S.

By using my small dataset, now I can get average of 0.7-0.89 mAP@iou0.5 (2classes).
I tested by using --freeze-backbone, the no parameter is reduced, so I can use higher batch size. On my GPU and Google Colab Tesla p100, maximum batch = 32, but got the above error

Thank you very much.

@hgaiser

kunnareekr commented 4 years ago

OMG, it is very long. I am sorry for my long questions

abhishek1222017 commented 4 years ago

I have same type of question. I have total 7900 images with training labels of 2,35000. My GPU support maximum batch-size of 2. If I calculate step size, then it is ~1,65000. Approximation time to complete 1 epoch itself will take ~23 hours. What shall I do?

kunnareekr commented 4 years ago

@abhishek1222017 How about your image size and GPU memory?

If the image is small and you have limited GPU, you can try to run on colab by refreshing until you get Tesla p100 16 GB.

I also waiting some expert to help clearly answer this.

I will try by resizing my images then, I will update again.

abhishek1222017 commented 4 years ago

Thank you for your suggestion. But Now working from home. So it is not possible to upload 25 GB data into gDrive. If I run with 10k step sizes, will it affect on the performance of the model?

On Mon, May 11, 2020 at 12:48 PM kunnareekr notifications@github.com wrote:

@abhishek1222017 https://github.com/abhishek1222017 How about your image size and GPU memory?

If the image is small and you have limited GPU, you can try to run on colab by refreshing until you get Tesla p100 16 GB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fizyr/keras-retinanet/issues/1350#issuecomment-626518232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIDEQAXBOVYJRWZN6MMNA53RQ6RDRANCNFSM4M4VUYNQ .

pleaseRedo commented 4 years ago

both image_size and batch_size will cause oom. I can give you my setup: gtx1080ti(11g ram) with batch=4 image _size = 512.

step_size is used by generator. I can give you example to better understand: 10k sample(num_of_images) batch-size =10 step = 1000 then each epoch would go through your whole dataset. if you have smaller step size like step = 100 then it takes 10 epochs to run through whole dataset. After each epoch, the generator won't reset, it will continue feeding your networks with leftovers. Just be clear, generator don't know the start/end of your dataset, it forms an endless sequence untill you want it to stop.

The benefit of having small steps(personally) is that I can see network performance more frequently on validation set. And it is also a no-brainer setup that I don't really care as long as I have a large epochs to train.

kunnareekr commented 4 years ago

@pleaseRedo

Thank you very much for your kind support and answers. Now I am sure that the sample size is no. of images, not annotations. @abhishek1222017 [You have to change your step size calculation too]

By using this RetinaNet repository, can we use the generator automatically by setting up small step size and more epochs? Or we have to modify some parts of the codes also?, I am not sure that shuffle, --workers, and --multi processing are related or not.

Please kindly suggest and give me some examples.

P.S. I will resizing my images used in the dataset and test it again, now working on re-checking bboxes.

pleaseRedo commented 4 years ago

By using this RetinaNet repository, can we use the generator automatically by setting up small step size and more

Ofc you can, setting those arguments are sufficient to get your generator working. I see many people(including myself) are confusing at begining because they think: num_steps = num_samples / batch_size Equation itself is correct but people are misinterpreting the idea by assuming num_samples has to be your sample size or each epochs has to run through whole data set. In fact, generator which comes with this retinanet implementation allows num_samples being variable so steps and batch are no longer bounded to num_samples and could be any value you want, num_samples will be computed and covered by generator internally. So no need to worry.

If I have 1000 samples, I want to use batch 8 so step size to feed all images in one epoch is 1000/16 = 125 I found that some people use step size = 1000

[1000 8 125] as computed by equation is totally ok as long as gpu memory is enough(not 16). If you set step to 1000 also works I guess? So each epoch will iterate through your dataset 8 times(could be wrong, not verified this). In your case, if you wish batch is 7, everything still works but keep in mind each epoch uses 875 images this time so the follow up epoch will use image(876-1000) first then image (1-875 second round).

workers and multiprocessing are used to speed up training. I found my training is ridiculously slow without these two on. They don't have anything to do with your generator.

kunnareekr commented 4 years ago

@pleaseRedo

Thank you very much for the answers. I am very appreciated.

Now I can understand how we can setup the parameters, but it should be based on basic.

For setting up, step size higher than sample size/ batch size, I have to test the accuracy to see the difference. For now I tested by using same dataset and batch, but different step size.

300 images, batch 7, step size 43, 100 epochs, no augmentation
300 images, batch 7, step size 1000, 50 epochs, no augmentation

Experiment No. 2 got a little higher AP, mAP, and got very lower loss, val_loss. But from loss and val_loss in No. 2, it will cause over fitting after 20 epochs (val_loss too high and far from loss)

But as you mentioned, it could be wrong. I don't know that feeding used images is wrong and it will mess up with gradient/loss/model fitting or not.

I have to study more about the network and basic things. My background is not ComSci.

Thanks to you again. (^w^)

abhishek1222017 commented 4 years ago

I am not yet cleared. If I have 7961 images with 233591 labels, what will be step size? I am literally confused. My gpu memory is 8 gb and it support batch size =2 at max. Please advise.

On Thu, 14 May, 2020, 8:38 PM kunnareekr, notifications@github.com wrote:

@pleaseRedo https://github.com/pleaseRedo

Thank you very much for the answers. I am very appreciated.

Now I can understand how we can setup the parameters, but it should be based on basic.

For setting up, step size higher than sample size/ batch size, I have to test the accuracy to see the difference. For now I tested by using same dataset and batch, but different step size.

300 images, batch 7, step size 43, 100 epochs, no augmentation

300 images, batch 7, step size 1000, 50 epochs, no augmentation

Experiment No. 2 got a little higher AP, mAP, and got very lower loss, val_loss. But from loss and val_loss in No. 2, it will cause over fitting after 20 epochs (val_loss too high and far from loss)

But as you mentioned, it could be wrong. I don't know that feeding used images is wrong and it will mess up with gradient/loss/model fitting or not.

I have to study more about the network and basic things. My background is not ComSci.

Thanks to you again. (^w^)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fizyr/keras-retinanet/issues/1350#issuecomment-628697692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIDEQAXNIQU75HMT35V7RUTRRQCOZANCNFSM4M4VUYNQ .

pleaseRedo commented 4 years ago

@kunnareekr One epoch of your second set up is equivalent to 23ish epochs of the first one. Overfitting is likely to happen if you have a small dataset and iterate through it many times like your case. I would suggest to use aug with default params, to see if it gives any AP gains. Cutout Mixup Cutmix also worth a try.

pleaseRedo commented 4 years ago

@abhishek1222017 step size * batch is the num_sample used for each epochs. Step_size could be any value and won't have any performance impact. I would use 4k in your case.

Reshaping img to smaller size allows you having larger batch

kunnareekr commented 4 years ago

@pleaseRedo

Thank you very much for your kind suggestion. I can have better understanding both this RetinaNet and basic things.

Basically, sample size = no. of images, step size = sample size/batch
batch size and image size can affect to GPU memory
Setting up number of "step size" is not fixed in this RetinaNet, but we should avoid over fitting and etc., and make sure it is match with the epoch and batch and number of samples (no. of images) @abhishek1222017

I did augmentation also , but not much impact to my accuracy result. I may have to add some augmentation related to color and intensity. My 2 classes are different in colors and texture in some parts, but it is still very similar. (T__T) This will be my new issue in the future, that how to adding new augmentation in the code. I will review and try first. (actually I tried but not work, it is programming skill problems.... LOL)

SID-SURANGE commented 4 years ago

Hi @kunnareekr, Can you share your thoughts on my problem. I have very small dataset of 490 sample for training(grayscale images) and each image has multiple objects. When I create annotation file with image data the file has around 1246 data rows due to multiple objects in same image.

Does the model considers 1246 as the total images as input or 490? I have been using bs 4 and step 1896, 25 epochs and get around 0.64 map with def. augmentations. Now as per the discussion above should I keep my num_steps = 490/4 (batch). Will this be good and prevent overfitting. How can I improve the model performance?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fizyr / keras-retinanet

Confusing about steps, batch, epoch calculation for my GPU #1350