Closed AlekseySh closed 7 months ago
Hi @AlekseySh, start working on it
@Natyren , great, good luck!
@AlekseySh, I found that memory bottleneck not on calculation step, but on storage step (when cpu need to keep in memory total [60k, 60k] shaped tensor). I can write generator based on pairwise computation, in this case storage will be loaded only with batch. Is it okay solution?
I'm sorry, I'm not sure if I understand you. Let's clarify.
The memory required for keeping the resulting matrix in memory is 60.000*60.000*32*1.25*1e-10=14.4 GB
(which is close to the figures on your screenshot). This value cannot be decreased; fortunately, it's not huge.
The problem appears in the intermediate calculation, so the memory pike is much higher than 14 GB.
@AlekseySh, yes, understand you. Apparently I cannot reproduce it locally, because I can calculate 60k60k pairwise matrix between two 60k384 matrixes (initialised randomly) locally (I have 16gb ram), so I continue to find out the appearance of the issue.
Hmm, weird.
Then you can try to run the whole script for which the problem appeared.
See validate_sop.py
(it requires dataset downloading)
https://github.com/OML-Team/open-metric-learning/tree/main/examples/sop
@AlekseySh I try to reproduce memory problem with pairwise_dist on collab (there is only 12gb ram), and it didn't fall when I used 40k384 and 60k384 matrixes. Looks like problem inside https://github.com/OML-Team/open-metric-learning/blob/6f9434eec71abddebd66f4174d4f62e468316533/oml/lightning/entrypoints/validate.py#L23 Maybe its with hydra (version specification or something else), maybe somewhere else. I trying to use validate_sop.py in colab instance, but it falls few times without log history. I will try again, but If you have error log already could you provide it
Got it, thx!
Could you run it locally? I did it once on a CPU-only machine, all the features for validation were extracted in 1 hour on a CPU.
@DaloroAT do you have an error log for SOP? I remember it crashed on your local machine recently because of OOM
Another option is to run a vanilla python example in collab and continue investigating there (see validation
example here: https://open-metric-learning.readthedocs.io/en/latest/examples/python.html). It works with a tiny dummy dataset, so, it needs to be replaced with SOP.
Trying to reproduce it locally, now stack at this error, but all modules versions corresponding to requirements.
As I know pickle requires some from importing classes (refer to https://stackoverflow.com/questions/52185507/pickle-and-decorated-classes-picklingerror-not-the-same-object), so I try to find out there is issue occur. Probably it can happen because of my cpu architecture (arm), but im not sure
@Natyren could you try to set cache_size=0
in your datasets?
Thank you @AlekseySh, it helped me reproduce issue of killed process locally. Now investigating issue causes.
@Natyren That's great! I think you can also run this script on a small dataset like CUB or CARS, to make sure that everything works for them and the only reason for failure is the size of the SOP dataset.
@DaloroAT do you have an error log for SOP? I remember it crashed on your local machine recently because of OOM
It says just Killed
. OOM
I run script on CUB, and run both torch and lightning profilers, still don't find there is bottleneck on code side. Still trying to find it, but looks like it in the end of computation here (profiler of all pipeline in tensor board) , and the layers corresponding to the last peak not the same that activated while pairwise distance computed. Probably issue on SOP appears because RAM occupied simultaneously with model and pairwise distance matrix. Will try to chunk computation in second one, and see what happened with ram
Thank you for working on that.
because RAM occupied simultaneously with model and pairwise distance matrix
This one can be easily tested if you create a "distances" matrix of random values, 2 sets of embeddings and a model simultaneously (without running any real pipelines). But, I guess, it's not an issue, since the pikes in RAM that I saw were way higher than 14 GB needed to store the matrix.
UPD: if you think that there is no problem in distances calculation, you can add some prints after this line: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/metrics/embeddings.py#L183
Another candidate for being a problem cause is this function: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L16 It runs after distances calculation.
Specifically, we had problems with RAM when sorting the distances matrix by rows, but after that we've updated the implementation with the usage of top_k
: https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L89
But this function still may contain not optimal code
https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/metrics/embeddings.py#L183 works correctly, process get killed in https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L16 line , so, I will try to find where it appeared and debug
Got it. I suspect topk
now :)
Thank you, will inspect it
Looks like finally I found it, issue happened in this code https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/oml/functional/metrics.py#L107 After I deleted it form here https://github.com/OML-Team/open-metric-learning/blob/c6004e4d2f43de43ca5c480cc69dbbef06599e69/examples/sop/configs/val_sop.yaml#L30 everything work fine So what is best solution there, just reconfigure config yaml or try to reduce computation of it in metrics calculation?
@Natyren Oh, I got it. It's a known issue (see the link: https://github.com/OML-Team/open-metric-learning/issues/251): this metric requires additional computational cost in terms of memory. We did not find any easy approach to optimize that metric. For the rest of the datasets it's okay, but for SOP the laptop memory may be not enough.
To address the issue, I've turned off this metric in big datasets' train configs: https://github.com/OML-Team/open-metric-learning/blob/main/examples/sop/configs/train_sop.yaml#L44 https://github.com/OML-Team/open-metric-learning/blob/main/examples/inshop/configs/train_inshop.yaml#L44
The problem is that I forget doing that in the validation configs as well. I guess, the more secure approach is to turn off that metric on a level of default arguments of a calculator: https://github.com/OML-Team/open-metric-learning/blob/main/oml/metrics/embeddings.py#L70
Could you create a PR?
After turning off this metric, please, check if everything is fine.
Got it, after deleting this parameter from args, I will make a PR. Do I understand correctly that this solution is for now, and when contribution on decreasing fmr memory footprint will appear, this arg will be turned on back?
Hey @Natyren Which version of PyTorch did you test? We have no fixed dependency, just a lower bound.
@DaloroAT Hello, I used upper bound of req version, torch 1.13.0, which was installed automatically with setup (oml dev mode)
@Natyren just to make sure: we don't remove the parameter, we just set an empty default argument for that to avoid calculating that metric
if we optimize the metric, we can put it back (but it's not a priority for sure)
What operating system are you using?
@DaloroAT macOS Monterey 12.5
@AlekseySh understand, I will do it in this way
Create pull request #290 , change default value to None and delete string about this metric in sop config.yaml
@Natyren, As far as I know, @DaloroAT still has the problem. And it appeared even before going into the metrics calculation. So, it crashed on distances matrix calculations. Could you provide more details, please?
@DaloroAT could you provide information about your ram capabilities, please
Also please provide specification of your system, cpu and installed packages
I will provide detailed info later after the job. But starting point:
Scripts failed during SOP validation on the calc_distance_matrix
step with OOM. If I set num_workers<=2
, then it is possible to calculate, but for this step requires 20 Gb.
Probably linux/macos or CPU/GPU versions have different implementations of cdist
.
Sorry for previous interruption from work. As you have time I think downgrading torch to 1.3.0 version is good idea (but more probable its not the issue here) https://github.com/OML-Team/open-metric-learning/blob/a0f24c39bb31bdf627eda269763ecfab294b289a/ci/requirements.txt#L2 and please provide information about tensors you give to pairwise_distance
Hey @Natyren I rechecked my setup.
PC:
Python packages:
I have tried several approaches to run a train on the SOP dataset. I always run train_sop.py with cache_size=0
, replaced sampler category_balance
to balance
to fit GPU memory, and added parameter limit_train_batches=1
in PL trainer here. Last step was done because training part is working well, but validation is failed with OOM.
RAM is used before starting at ~4.3 Gb. I run with the following parameters:
1) num_workers=6
. Inference on GPU is ok, but the process killed during calc_distance_matrix in metrics.
2) num_workers=0
. Inference on GPU is ok, distances
is calculated (step consumed ~20 Gb), but failed on calc_retrieval_metrics with OOM. Even if I calculated only cmc/1
.
I also tried to run the snippet:
import torch
from oml.functional.metrics import calc_distance_matrix
num_samples = 61000
feat_dim = 384
embeddings = torch.randn((num_samples, feat_dim))
is_query = torch.ones(num_samples)
is_gallery = torch.ones(num_samples)
distances = calc_distance_matrix(embeddings, is_query, is_gallery)
It consumes <16 Gb.
Have you tried to launch train on your mac?
Hi @DaloroAT, understand your issue. No, I wasn't trying to train on my Mac, but as I know similar characteristics in Kaggle notebooks, so, probably I will try it here
You don't actually need to train with a full epoch for my issue. Just do 1 step of training and full validation.
Let me know what you will receive in Kaggle.
This issue is available again for anyone who wants to work on it :)
there is more actual issue: https://github.com/OML-Team/open-metric-learning/issues/506
pairwise_dist
function produces a huge memory footprint, which has to be decreased (https://github.com/OML-Team/open-metric-learning/blob/b9d70cd15147a7f937499ac6d2838a71b0f4348b/oml/utils/misc_torch.py#L85)Context: this function is needed when we calculate the distances matrix between queries and galleries during the validation. For Standford Online Products, we need to calculate the matrix between two tensors with the shape of 60k x 384 each (60k stands for the number of images, and 384 is the dimension of the features). It cannot fit into the memory of a personal computer.
One of the possible solutions may be to calculate the distances using batches.
The solution has to be tested and report the memory footprint on the sizes mentioned above.