LoadBatch long delays...

jflich commented 3 years ago

I have been playing for two days with the skin_cancer_segmentation use case, both training and inference. It works and it converges the training as expected. However, for the inference side I have seen quite opposite delays between the LoadBatch method and the forward process of the network (ResNet). I am using Tesla V100 and BeeGFS as file system. I see the following results:

batch of one image: LoadBatch: 0.5-2 secs, Forward: 120msec batch of 12 images: LoadBatch: 16-20 secs, Forward: 259msec batch of 32 images: LoadBatch: 48 secs, Forward: 451msec

This 100x delay difference is quite strange to me and makes me wonder whether our system configuration is the cause of those delays. So, I would be grateful if you could tell me if those delays are expected somehow or I should do anything on my side to reduce the delay to msecs.

Notice that the GPU is almost unused as we are 99% of time preparing a batch, even if is one single image… In case you confirm these differences, is there any possible solution for reducing the delay of LoadBatch? Anything we could do at our side?

Thanks

lauracanalini commented 3 years ago

Did you mean SegNet as network for skin_cancer_segmentation? Also for me for one image the LoadBatch takes 0.4-1.8 secs, but the forward is around 70 msec. We can't use batch_size greater than 10 because we have 11GB of memory in our GPU, and in this case the LoadBatch takes 10-15 secs, and the forward is around 180 msec. I noticed that in the dataset we use for classification the first images range from 600x400 to 1024x1024, while in the segmentation one from 4288x2848 to 6668x4439, therefore to load and apply the augmentations (in inference only the ResizeDim) certainly requires more time, also because we have to repeat the same operations twice (once for the sample and once for the ground truth).

To speed up these times we can apply the same parallelization method to prepare the next batch while it's training the previous one which is already in use for skin_lesion_classification. I will do it asap.

jflich commented 3 years ago

Thanks Laura,

anyway, I do not think doing this parallelization method will solve the issue. You can overlap but only the few miliseconds it takes to do the forward...

I think the problem is with the efficiency of the data augmentation process... Are you using OpenMP to parallelize the data augmentation kernels?

lauracanalini commented 3 years ago

It speeds up a lot actually, because it simultaneously prepares a number of batches equal to the number of "producers" that are specified in the DataGenerator. I did some tests by changing from ful_mem to low_mem in the toGPU EDDL method, so that I could have bigger batches and these are the results:

batch_size = 1: total time with the old version: 785.996s
                mean LoadBatch: 1.195s
                mean forward: 71.440ms
batch_size = 1: total time with parallelization: 164.204s
                mean LoadBatch: 0.125118s
                mean forward: 76.1899ms

batch_size = 12: total time with the old version: 726.96s
                 mean LoadBatch: 13.9563s
                 mean forward: 226.325ms
batch_size = 12: total time with parallelization: 158.91s
                 mean LoadBatch: 2.34495s
                 mean forward: 223.744ms

batch_size = 48: total time with the old version: 690.207s
                 mean LoadBatch: 55.3276s
                 mean forward: 670.4336ms
batch_size = 48: total time with parallelization: 167.77s
                 mean LoadBatch: 14.0762s
                 mean forward: 693.699ms

I'm going to push these changes to the pipeline.

deephealthproject / use-case-pipelines

LoadBatch long delays... #10