Data Pipeline - Githubissues

FrancescoSaverioZuppichini / detector

14 stars 0 forks source link

Data Pipeline #4

Open FrancescoSaverioZuppichini opened 1 year ago

FrancescoSaverioZuppichini commented 1 year ago

I need to develop a fast data pipeline to not be bottlenecked by loading data as it usually happens in almost all models. To archive this I need to batch everything, store the resulting vector to a file and memap it.

There are different issues that needs to be solved:

decide how to pad bboxes, I can pad all bboxes to match the image with the biggest number of bboxes
pre resize the images, I can resize all images (by keeping the aspect ratio) to a bigger size than the train one (e.g. 1024x1024 vs 640x640)
if I pad, I need to add a tensor that tells me where I pad (to be used in the loss)
be smart about data types, I can store the images as uin8 and the bboxes as int64

I could preprocess the files in rust, by creating the correct numpy format

FrancescoSaverioZuppichini commented 1 year ago

We will try to use ffcv

FrancescoSaverioZuppichini commented 1 year ago

After reviewing the ffcv performance guide, I understood that:

ffcv stores images, not raw pixel data (so no direct memmap)
ffcv requires extra deps (like libjpeg-turbo)
ffcv does a lot of low level optimization stuff

Since, we would like to work directly on pixel_data, we will switch to tensordict

FrancescoSaverioZuppichini commented 1 year ago

If I want to use tensordict, I will have do to two things:

read the whole dataset, parse it (e.g. getting the images and labels) and store it as a memmap np file
decide if and how I need to pad
- if I need to pad the bboxes and labels I need to know the biggest amount of bboxes one image can have (I can efficiently computing it by looking at the file size and knowing the line size)

FrancescoSaverioZuppichini commented 1 year ago

tensordict doesn't support loading a memmap array from file, I should make a PR

[EDIT] I was wrong, there is a way on the doc

FrancescoSaverioZuppichini commented 1 year ago

After a lot of experiments, resulting in this benchmark. I have concluded that the best tradeoff between mental sanity and development is to first implement a normal dataset loading using Dataset but the augmentations as nn.Module so I can send to the GPU a uint8 image and drastically improve throughput during training

FrancescoSaverioZuppichini commented 1 year ago

The next step to try would be to

memmap images as bytes
memmap all the labels

read them in the dataset by just using the idx, still not ideal since we will not taking a slice but probably will avoid page fault