Investigate & solve the problem with OOM on SOP dataset

AlekseySh commented 1 year ago

pairwise_dist function produces a huge memory footprint, which has to be decreased (https://github.com/OML-Team/open-metric-learning/blob/b9d70cd15147a7f937499ac6d2838a71b0f4348b/oml/utils/misc_torch.py#L85)

Context: this function is needed when we calculate the distances matrix between queries and galleries during the validation. For Standford Online Products, we need to calculate the matrix between two tensors with the shape of 60k x 384 each (60k stands for the number of images, and 384 is the dimension of the features). It cannot fit into the memory of a personal computer.

One of the possible solutions may be to calculate the distances using batches.

The solution has to be tested and report the memory footprint on the sizes mentioned above.

Natyren commented 1 year ago

Hi @AlekseySh, start working on it

AlekseySh commented 1 year ago

@Natyren , great, good luck!

Natyren commented 1 year ago

@AlekseySh, I found that memory bottleneck not on calculation step, but on storage step (when cpu need to keep in memory total [60k, 60k] shaped tensor). I can write generator based on pairwise computation, in this case storage will be loaded only with batch. Is it okay solution?

AlekseySh commented 1 year ago

I'm sorry, I'm not sure if I understand you. Let's clarify.

The memory required for keeping the resulting matrix in memory is 60.000*60.000*32*1.25*1e-10=14.4 GB (which is close to the figures on your screenshot). This value cannot be decreased; fortunately, it's not huge.

The problem appears in the intermediate calculation, so the memory pike is much higher than 14 GB.

Natyren commented 1 year ago

@AlekseySh, yes, understand you. Apparently I cannot reproduce it locally, because I can calculate 60k60k pairwise matrix between two 60k384 matrixes (initialised randomly) locally (I have 16gb ram), so I continue to find out the appearance of the issue.

AlekseySh commented 1 year ago

Hmm, weird.

Then you can try to run the whole script for which the problem appeared.

See validate_sop.py (it requires dataset downloading) https://github.com/OML-Team/open-metric-learning/tree/main/examples/sop

Natyren commented 1 year ago

@AlekseySh I try to reproduce memory problem with pairwise_dist on collab (there is only 12gb ram), and it didn't fall when I used 40k384 and 60k384 matrixes. Looks like problem inside https://github.com/OML-Team/open-metric-learning/blob/6f9434eec71abddebd66f4174d4f62e468316533/oml/lightning/entrypoints/validate.py#L23 Maybe its with hydra (version specification or something else), maybe somewhere else. I trying to use validate_sop.py in colab instance, but it falls few times without log history. I will try again, but If you have error log already could you provide it

AlekseySh commented 1 year ago

Got it, thx!

Could you run it locally? I did it once on a CPU-only machine, all the features for validation were extracted in 1 hour on a CPU.

@DaloroAT do you have an error log for SOP? I remember it crashed on your local machine recently because of OOM

AlekseySh commented 1 year ago

Another option is to run a vanilla python example in collab and continue investigating there (see validation example here: https://open-metric-learning.readthedocs.io/en/latest/examples/python.html). It works with a tiny dummy dataset, so, it needs to be replaced with SOP.

Natyren commented 1 year ago

Trying to reproduce it locally, now stack at this error, but all modules versions corresponding to requirements.

As I know pickle requires some from importing classes (refer to https://stackoverflow.com/questions/52185507/pickle-and-decorated-classes-picklingerror-not-the-same-object), so I try to find out there is issue occur. Probably it can happen because of my cpu architecture (arm), but im not sure

AlekseySh commented 1 year ago

@Natyren could you try to set cache_size=0 in your datasets?

Natyren commented 1 year ago

Thank you @AlekseySh, it helped me reproduce issue of killed process locally. Now investigating issue causes.

AlekseySh commented 1 year ago

@Natyren That's great! I think you can also run this script on a small dataset like CUB or CARS, to make sure that everything works for them and the only reason for failure is the size of the SOP dataset.

DaloroAT commented 1 year ago

@DaloroAT do you have an error log for SOP? I remember it crashed on your local machine recently because of OOM

It says just Killed. OOM

Natyren commented 1 year ago

I run script on CUB, and run both torch and lightning profilers, still don't find there is bottleneck on code side. Still trying to find it, but looks like it in the end of computation here (profiler of all pipeline in tensor board) Screenshot 2023-02-01 at 00 13 55 , and the layers corresponding to the last peak not the same that activated while pairwise distance computed. Probably issue on SOP appears because RAM occupied simultaneously with model and pairwise distance matrix. Will try to chunk computation in second one, and see what happened with ram

AlekseySh commented 1 year ago

Thank you for working on that.

because RAM occupied simultaneously with model and pairwise distance matrix

This one can be easily tested if you create a "distances" matrix of random values, 2 sets of embeddings and a model simultaneously (without running any real pipelines). But, I guess, it's not an issue, since the pikes in RAM that I saw were way higher than 14 GB needed to store the matrix.

AlekseySh commented 1 year ago

UPD: if you think that there is no problem in distances calculation, you can add some prints after this line: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/metrics/embeddings.py#L183

AlekseySh commented 1 year ago

Another candidate for being a problem cause is this function: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L16 It runs after distances calculation.

Specifically, we had problems with RAM when sorting the distances matrix by rows, but after that we've updated the implementation with the usage of top_k: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L89

But this function still may contain not optimal code

Natyren commented 1 year ago

https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/metrics/embeddings.py#L183 works correctly, process get killed in https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L16 line , so, I will try to find where it appeared and debug

AlekseySh commented 1 year ago

Got it. I suspect topk now :)

Natyren commented 1 year ago

Thank you, will inspect it

Natyren commented 1 year ago

Looks like finally I found it, issue happened in this code https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L107 After I deleted it form here https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/examples/sop/configs/val_sop.yaml#L30 everything work fine So what is best solution there, just reconfigure config yaml or try to reduce computation of it in metrics calculation?

AlekseySh commented 1 year ago

@Natyren Oh, I got it. It's a known issue (see the link: https://github.com/OML-Team/open-metric-learning/issues/251): this metric requires additional computational cost in terms of memory. We did not find any easy approach to optimize that metric. For the rest of the datasets it's okay, but for SOP the laptop memory may be not enough.

To address the issue, I've turned off this metric in big datasets' train configs: https://github.com/OML-Team/open-metric-learning/blob/main/examples/sop/configs/train_sop.yaml#L44 https://github.com/OML-Team/open-metric-learning/blob/main/examples/inshop/configs/train_inshop.yaml#L44

The problem is that I forget doing that in the validation configs as well. I guess, the more secure approach is to turn off that metric on a level of default arguments of a calculator: https://github.com/OML-Team/open-metric-learning/blob/main/oml/metrics/embeddings.py#L70

Could you create a PR?

After turning off this metric, please, check if everything is fine.

Natyren commented 1 year ago

Got it, after deleting this parameter from args, I will make a PR. Do I understand correctly that this solution is for now, and when contribution on decreasing fmr memory footprint will appear, this arg will be turned on back?

DaloroAT commented 1 year ago

Hey @Natyren Which version of PyTorch did you test? We have no fixed dependency, just a lower bound.

Natyren commented 1 year ago

@DaloroAT Hello, I used upper bound of req version, torch 1.13.0, which was installed automatically with setup (oml dev mode)

AlekseySh commented 1 year ago

@Natyren just to make sure: we don't remove the parameter, we just set an empty default argument for that to avoid calculating that metric

if we optimize the metric, we can put it back (but it's not a priority for sure)

DaloroAT commented 1 year ago

What operating system are you using?

Natyren commented 1 year ago

@DaloroAT macOS Monterey 12.5

Natyren commented 1 year ago

@AlekseySh understand, I will do it in this way

Natyren commented 1 year ago

Create pull request #290 , change default value to None and delete string about this metric in sop config.yaml

AlekseySh commented 1 year ago

@Natyren, As far as I know, @DaloroAT still has the problem. And it appeared even before going into the metrics calculation. So, it crashed on distances matrix calculations. Could you provide more details, please?

Natyren commented 1 year ago

@DaloroAT could you provide information about your ram capabilities, please

Natyren commented 1 year ago

Also please provide specification of your system, cpu and installed packages

DaloroAT commented 1 year ago

I will provide detailed info later after the job. But starting point:

Ubuntu 22.04
32 Gb RAM
torch=1.13.1, GPU, CUDA11.7

Scripts failed during SOP validation on the calc_distance_matrix step with OOM. If I set num_workers<=2, then it is possible to calculate, but for this step requires 20 Gb.

DaloroAT commented 1 year ago

Probably linux/macos or CPU/GPU versions have different implementations of cdist.

Natyren commented 1 year ago

Sorry for previous interruption from work. As you have time I think downgrading torch to 1.3.0 version is good idea (but more probable its not the issue here) https://github.com/OML-Team/open-metric-learning/blob/a0f24c39bb31bdf627eda269763ecfab294b289a/ci/requirements.txt#L2 and please provide information about tensors you give to pairwise_distance

DaloroAT commented 1 year ago

Hey @Natyren I rechecked my setup.

PC:

Ubuntu 22.04
i9-12900K
32 Gb RAM
3090ti 24 Gb

Python packages:

torch==1.13.0, GPU, CUDA11.7 (previously I said 1.13.1 - it was wrong)
pytorch-lightning==1.7.0
numpy==1.23.1

I have tried several approaches to run a train on the SOP dataset. I always run train_sop.py with cache_size=0, replaced sampler category_balance to balance to fit GPU memory, and added parameter limit_train_batches=1 in PL trainer here. Last step was done because training part is working well, but validation is failed with OOM.

RAM is used before starting at ~4.3 Gb. I run with the following parameters: 1) num_workers=6. Inference on GPU is ok, but the process killed during calc_distance_matrix in metrics. 2) num_workers=0. Inference on GPU is ok, distances is calculated (step consumed ~20 Gb), but failed on calc_retrieval_metrics with OOM. Even if I calculated only cmc/1.

I also tried to run the snippet:

import torch

from oml.functional.metrics import calc_distance_matrix

num_samples = 61000
feat_dim = 384

embeddings = torch.randn((num_samples, feat_dim))
is_query = torch.ones(num_samples)
is_gallery = torch.ones(num_samples)

distances = calc_distance_matrix(embeddings, is_query, is_gallery)

It consumes <16 Gb.

Have you tried to launch train on your mac?

Natyren commented 1 year ago

Hi @DaloroAT, understand your issue. No, I wasn't trying to train on my Mac, but as I know similar characteristics in Kaggle notebooks, so, probably I will try it here

DaloroAT commented 1 year ago

You don't actually need to train with a full epoch for my issue. Just do 1 step of training and full validation.

Let me know what you will receive in Kaggle.

AlekseySh commented 1 year ago

This issue is available again for anyone who wants to work on it :)

AlekseySh commented 7 months ago

there is more actual issue: https://github.com/OML-Team/open-metric-learning/issues/506

OML-Team / open-metric-learning

Investigate & solve the problem with OOM on SOP dataset #283