FrancescoSaverioZuppichini / detector

14 stars 0 forks source link

Data Pipeline #4

Open FrancescoSaverioZuppichini opened 1 year ago

FrancescoSaverioZuppichini commented 1 year ago

I need to develop a fast data pipeline to not be bottlenecked by loading data as it usually happens in almost all models. To archive this I need to batch everything, store the resulting vector to a file and memap it.

There are different issues that needs to be solved:

I could preprocess the files in rust, by creating the correct numpy format

FrancescoSaverioZuppichini commented 1 year ago

We will try to use ffcv

FrancescoSaverioZuppichini commented 1 year ago

After reviewing the ffcv performance guide, I understood that:

Since, we would like to work directly on pixel_data, we will switch to tensordict

FrancescoSaverioZuppichini commented 1 year ago

If I want to use tensordict, I will have do to two things:

FrancescoSaverioZuppichini commented 1 year ago

tensordict doesn't support loading a memmap array from file, I should make a PR

[EDIT] I was wrong, there is a way on the doc

FrancescoSaverioZuppichini commented 1 year ago

After a lot of experiments, resulting in this benchmark. I have concluded that the best tradeoff between mental sanity and development is to first implement a normal dataset loading using Dataset but the augmentations as nn.Module so I can send to the GPU a uint8 image and drastically improve throughput during training

FrancescoSaverioZuppichini commented 1 year ago

The next step to try would be to

read them in the dataset by just using the idx, still not ideal since we will not taking a slice but probably will avoid page fault