Memory leak with `h5py` from `pip` and conversion to `torch.Tensor`

Breeze-Zero commented 2 years ago

I recently tried to do some experiments on my model with multi-coil FastMRI brain data. Due to the need for flexibility (and also because I don't have the extra time to learn how to use Pytorch lighting), I didn't use Pytorch Lighting directly. Instead, I chose normal Pytorch, but during the iterating process, I only set num_worker=2, and my memory footprint was quite large at the beginning. As the number of iterations increased, an error occurred: RuntimeError: DataLoader worker (PID 522908) is killed by signal: killed. I checked the training codes of other parts, but no obvious memory accumulation error was found. Therefore, I thought there was a large probability of a problem in siliceDataset. I simply used "pass" to traverse the Dataloader loop, and found that the memory occupation kept rising.

mmuckley commented 2 years ago

Hello @834799106, thanks for putting an issue here.

Based on your error I doubt the SliceDataset class is the issue. For one, your program is not being terminated due to memory, it is being terminated because some process killed the overall program. Also, if you look at the __getitem__ function, you can see that there are no side-effects. Everything the function creates should be returned to the calling function or destroyed.

In order to verify a memory leak we will need you to give us a reproducible example for your case since you're not using the PyTorch Lightning modules. Also, please let us know what version of PyTorch you are using and any information you have on the memory usage throughout an epoch. Note: high memory at the start might be expected, as you have your model in memory. There is also some metadata about the dataset that is precomputed and stored in memory.

soumickmj commented 2 years ago

Hi @mmuckley I was about to file an issue for a memory leak. I'm not sure about the issue of @834799106 though. I have created a small pieace of code for reproducing.

` from fastmri.data.transforms import UnetDataTransform from fastmri.data import SliceDataset from fastmri.data.subsample import create_mask_for_mask_type import os from torch.utils.data import DataLoader from tqdm import tqdm

mask_func = create_mask_for_mask_type( mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain/multicoil_train"

sd = SliceDataset(root=root_gt, challenge="multicoil", transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), use_dataset_cache=True, dataset_cache_file=f"{os.path.dirname(root_gt)}/datasetcache{os.path.basename(root_gt)}.pkl") dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

for e in tqdm(dl): del e pass `
I'm currently using the latest git pull of fastMRI.

While running this code, I was monitoring the memory usage, even though I'm deleting the varible, it is still increasing the memory usage constantly. Originally, this was part of my other pipeline where I'm only using the SliceDataset and not the whole Lightning module. If you would like to have a look, this is the code: https://github.com/soumickmj/NCC1701/blob/main/Engineering/datasets/fastMRI.py

I was originally thinking maybe my code is creating the leak. But with other dataset mode (different code for reading other datasets) of the same NCC1701 pipeline of mine did not create the leak. Then I wrote that small script to try to see if the leak is still there when my pipeline is not involved.

soumickmj commented 2 years ago

I also got a similar behaviour while using the Data Module.

` from fastmri.data.transforms import UnetDataTransform from fastmri.pl_modules import FastMriDataModule from fastmri.data import SliceDataset from fastmri.data.subsample import create_mask_for_mask_type import os from torch.utils.data import DataLoader from tqdm import tqdm

mask_func = create_mask_for_mask_type( mask_type_str="random", center_fractions=[0.08], accelerations=[8])

root_gt = "/data/project/fastMRI/Brain"

data_module = FastMriDataModule(data_path=root_gt, challenge="multicoil", train_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), val_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), test_transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False), batch_size=1, num_workers=10) dl = data_module.train_dataloader()

for e in tqdm(dl): del e pass

`

mmuckley commented 2 years ago

Hello @soumickmj, I ran your script on the knee validation data with memory-profiler and memory usage peaked pretty early a little bit less than 5 GB (see attached), staying flat for the rest of the entire dataset afterwards (which does not suggest a leak).

Perhaps you could try running on your system to verify with PyTorch 1.10?

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

@profile
def main():
    val_path = "/path/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(dl):
        del e
        pass

if __name__ == "__main__":
    main()

You can run with mprof run --include-children file.py.

Breeze-Zero commented 2 years ago

hi @mmuckley，I copied the above code to my machine, but modified batch Szie. The figure below is the result of mprof run --include-children file.py I didn't even finish an epoch before broke off。

This is the code:

from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

@profile
def main():
    val_path = "/data2/fastmri/mnt/multicoil_val"
    mask_func = create_mask_for_mask_type(
        mask_type_str="random", center_fractions=[0.08], accelerations=[8]
    )

    sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=4, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass

if __name__ == "__main__":
    main()

Maybe it's PyTorch version that's causing the problem. My PyTorch version is 1.8.1+ cu111

soumickmj commented 2 years ago

Sorry @mmuckley I also got the same problem after running memory profiler. I used to different versions of PyTorch. On the contrary to @834799106 I am using more latest versions of PyTorch.

With PyTorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 I got:-

fastM11

With PyTorch 1.11.0.dev20220129 py3.9_cuda11.3_cudnn8.2.0_0 (pytorch-nightly), due to the features I use, I usually need this version for my work:-

fastM10

I did not run till the very end as it was continously increasing and would have crashed the server again which has 250GB of RAM. So I don't feel its related to the PyTorch version.

Just to let you know: the OS is Ubuntu 20.04.3 LTS and the python version is 3.9.7

soumickmj commented 2 years ago

sd = SliceDataset(
        root=val_path,
        challenge="multicoil",
        transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
    )
    dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)

    for e in tqdm(sd):
        del e
        pass

Hi @mmuckley In your code, I just noticed that you are looping over the dataset directly without the dataloader, whereas in my code I'm running over DataLoader. Can you please also try that out too? In my case, both shown similar behaviour. I also tested with 0,1,3 as number of workers - all the same. I created another conda env with python 3.7.11 and torch 1.8.2, got similar behaviour as well.

Am I doing something wrong somewhere?

mmuckley commented 2 years ago

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

soumickmj commented 2 years ago

Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB.

Ah okay, no problem! But still, in my case, I'm getting this constant increase in terms of memory usage as you can see from the plots. Any suggestions?

mmuckley commented 2 years ago

One thing I notice is that you are both using Python 3.9. I could try Python 3.9 and check with that perhaps.

EDIT: Sorry, I see you also tried Python 3.7. Not sure what to do then...

mmuckley commented 2 years ago

Okay, I tested on Python 3.9 and I'm still getting the same behavior with batch_size=4.

To help a bit more with this I'm including my complete conda environments that I used for my tests. I'm using a custom Linux distribution based on 5.4.0-81-generic x86_64 on with an Intel Xeon E5-2698. @soumickmj @834799106, if either of you could try one of my conda environments maybe we could figure out if it's one of the packages.

Python 3.8 environment Python 3.9 environment

soumickmj commented 2 years ago

Thanks @mmuckley I have tested with your 3.9 conda env and it worked without a problem. Your yml was missing fastMRI: So I installed it using pip install git+https://github.com/facebookresearch/fastMRI.git

BS1_newEnv

Then I created a "bare minimum" env without using your env, and this resulted in the old issue again. For this env, I just installed pytorch 1.10.2 with cuda 11.3 and then directly fastMRI repo using git similar to the other one. This conda environment doesn't contain anything which might not be requred.

BS1_newEnvMin

I compared the versions. Initially, the version of numpy was different (1.21.2). So I switched to the one you have 1.20.3 Apart from all the extra packages which are there in your env, I couldn't see any difference. Here is the yml file, zipped as it wasn't allowing me to upload yml here. fMMem.zip

Do you have any idea what might be the reason? Do I need some additional package that it is not complaining about but still required and causing this issue?

Breeze-Zero commented 2 years ago

I was in a similar situation. The problem was solved by installing your Py3.9 environment directly with Conda, however, creating a Py3.9 environment from Conda normally and pip install git+https://github.com/facebookresearch/fastMRI.git and pandas (It's not in the FastMRI package), the problem remains.

mmuckley commented 2 years ago

So my install process is as follows in a few bash commands:

conda create -n memory_test_py39 python=3.9
conda activate memory_test_py39
conda install anaconda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install -e .

Where the pip install -e . is in the fastMRI folder.

In that case I will try to reproduce now with your minimal environments.

mmuckley commented 2 years ago

@soumickmj @834799106 I can now reproduce this with the minimal install environment.

Reproduction environment here: https://gist.github.com/mmuckley/838289a388bc65a7adb23d67908635c9.

soumickmj commented 2 years ago

@mmuckley This is really strange! That bare minimum environment had nothing almost. While running the code, fastMRI did not throw any errors about any missing pacakges. Still, we all got this weird behaviour. Do you have any hunch?

mmuckley commented 2 years ago

I do think it is related to SliceDataset itself as I see similar characteristics with VarNetDataTransform.

mmuckley commented 2 years ago

Actually I have to take that back. With the minimal environment and no transforms, I see no issue.

soumickmj commented 2 years ago

Aahahaha, yes, I can confirm that too. I tried with my old work environment, running pytorch nightly (1.11dev2) and saw the same behaviour.

Then the problem is with the DataTransformations and not with the SliceDataset!

soumickmj commented 2 years ago

I might have found the source. Conversion from numpy to pytorch tensor.

I did not test using your transforms though, but using my transform - but they were showing a similar behaviour.

EDIT: Here's my code For testing purposes, I returned directly after line 87 in both cases.

Breeze-Zero commented 2 years ago

I tried calling return kspace_torch after each line of UnetDataTransform and found that kspace_torch = to_tensor(kspace) doesn't have a memory leak. After mask_func I started having problems, but when I set mask_func=None the problems disappeared. Then add return after image = fastmri.ifft2c(masked_kspace), and the problem arises again. Therefore, this may not be a single statement problem. Due to the time difference, I had to rest and could not conduct more investigations for the time being.

mmuckley commented 2 years ago

Okay I think I found the issue: it is due to the h5py from pip. It's a relatively recent issue that has been documented here:

https://forum.hdfgroup.org/t/h5py-memory-leak-in-combination-with-pytorch-dataset-with-multiple-workers/9114

If you install the minimal environment, but use the h5py from conda instead of pip then memory stays stable.

soumickmj commented 2 years ago

Thanks @mmuckley I also stumbled upon the same root for our problem. And for me, it got solved by building and installing H5py from the gitrepo. I will check out your conda remedy now!

mmuckley commented 2 years ago

With the h5py from conda if I print(h5py.version.info) then I get the following:

Summary of the h5py configuration
---------------------------------

h5py    3.6.0
HDF5    1.10.6
Python  3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.21.2
cython (built with) 0.29.24
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6

So the conda h5py was built against HDF5 1.10.6, before the issue begins.

soumickmj commented 2 years ago

Thanks @mmuckley It works perfectly with conda I tried with my old env with PyTorch nightly!

Thanks for all your help!

PS: Maybe you can put a notice on the homepage of fastMRI for people to know about this as the impact can be significant.

mmuckley commented 2 years ago

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

soumickmj commented 2 years ago

Thanks! I will also explore possible ideas! If I find some fix, I will let you know :)

soumickmj commented 2 years ago

Hi @mmuckley, the issue resurfaced after the conda version of h5py got updated as well. This time (I don't know why, what missmatched!) I also had problem with the Git version. One possible workaround would be to use ".copy()" after every h5py operation.

So basically, inside the get item function of mri_data.py, we need something like this:-

with h5py.File(fname, "r") as hf: kspace = hf["kspace"][dataslice].copy() mask = np.asarray(hf["mask"].copy()) if "mask" in hf else None target = hf[self.recons_key][dataslice].copy() if self.recons_key in hf else None (Pull request #227)

Maybe its dirty (not sure if it will have some other implications - say in terms of speed), but for me, its working so far. Can you please have a look and let me know your thoughts :)

mmuckley commented 2 years ago

Hello @soumickmj I do not observe HDF5 being updated on conda, at least for Python version 3.8 or 3.9.

Sarah-2021-scu commented 1 year ago

Hi @mmuckley

I am running varnet demo for a small set of brain mri data, and I am getting the following error after 3-4 iterations of the first epoch:

RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 11.91 GiB total capacity; 10.40 GiB already allocated; 74.81 MiB free; 10.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Following is my h5py version:

h5py 3.7.0 HDF5 1.10.6 Python 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] sys.platform linux sys.maxsize 9223372036854775807 numpy 1.23.5 cython (built with) 0.29.30 numpy (built against) 1.16.6 HDF5 (built against) 1.10.6

I am also attaching my memory profiler plot.

mem_prof_fastmri_brain

Please tell me where i am going wrong. Thank you for your help!

mmuckley commented 1 year ago

Hello @Sarah-2021-scu, your memory usage is good. The error in your case is on the GPU, which is not related to this particular issue.

It looks like your GPU may just be too small. You could try running the model with a lower cascade count.

Sarah-2021-scu commented 1 year ago

Thank you @mmuckley for your response. I am using 2 GPU's 12GB of memory each. I will lower the cascade count as well. The other options I have are:

1 GPU with 32GB memory.
4 GPU with 12GB memory. Which will be the best option to choose from the above 2?

hujb48 commented 9 months ago

Okay I opened #217 to do this. Feel free to propose any changes.

For what it's worth, I did some small tests on adding extra copy commands to SliceDataset to get around the leak, but nothing I tried worked, so we may just wait for this to be fixed upstream.

I got the same situation while trained the 'unet baseline model', and I followed this issue #217, by the command pip uninstall h5py and conda install h5py==3.6.0, it works without any problem, and my env is python 3.8.18 with torch 1.13.0 +cuda 11.7.

facebookresearch / fastMRI

Memory leak with `h5py` from `pip` and conversion to `torch.Tensor` #215