Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.73k stars 1.05k forks source link

epoch durations significantly increase over time when using cached datasets?! #3116

Open neuronflow opened 2 years ago

neuronflow commented 2 years ago

Describe the bug On two servers I observe substantially increasing epoch durations when training with cached datasets. The effects are way more pronounced on server A (64cores) with no swap than on Server B (14cores) with now 144gb of swap memory.

The trainings start smooth and then become slower after a while. Unfortunately, I can atm only access server B now to provide logs. This should change in the next days.

I tried several recent pytorch and Monai versions and building pytorch from source. Furthermore I tried using: torch.cuda.empty_cache() after every epoch.

Furthermore I tried different values from the amount of workers from 0 to max(cpu_cores) * 2 for each machine.

On both servers I notice increased "red kernels" indicating wait for system processes in htop when the servers slow down: server A:

Screenshot 2021-10-12 at 16 59 52

server B:

Screenshot 2021-10-12 at 16 01 31

Also attached a screenshot from server B with the increased epoch durations (the spike every few epochs is from an extended validation epoch). One can see how the epoch times go up from ~350s in the beginning to over 2000s. Notice, this effect is way more pronounced on server A (orders of magnitude), I hope I can present some timings from training on server A without special validation epochs soon.

Screenshot 2021-10-12 at 16 44 39

Shortly before epoch 80 I increased the swap memory file from 16 to 144gb, which seems to have helped a bit.

Interestingly this effect seems to be only there for big datasets. For smaller datasets everything seems to be okay, one can see slightly decreasing epoch times, which I attribute to the caching?

Screenshot 2021-10-12 at 16 43 33

Notably, nobody else was using the servers for my tests.

To Reproduce for example run: https://github.com/Project-MONAI/tutorials/blob/master/3d_segmentation/challenge_baseline/run_net.py

other cached datasets I tried: PersistentDataset CacheDataset SmartCacheDataset¶

Expected behavior epoch times should remain constant like for the training with the small dataset.

Environment Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'
================================
Printing MONAI config...
================================
MONAI version: 0.6.0
Numpy version: 1.21.2
Pytorch version: 1.9.1+cu111
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 0ad9e73639e30f4f1af5a1f4a45da9cb09930179

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.1
scikit-image version: 0.18.3
Pillow version: 8.3.1
Tensorboard version: 2.6.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.10.1+cu111
ITK version: NOT INSTALLED or UNKNOWN VERSION.
tqdm version: 4.62.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.8.0
pandas version: 1.3.3
einops version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.3 LTS
Platform: Linux-5.4.0-88-generic-x86_64-with-glibc2.31
Processor: x86_64
Machine: x86_64
Python version: 3.9.7
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: [popenfile(path='/home/florian/.vscode-server/bin/ee8c7def80afc00dd6e593ef12f37756d8f504ea/vscode-remote-lock.florian.ee8c7def80afc00dd6e593ef12f37756d8f504ea', fd=99, position=0, mode='w', flags=32769)]
Num physical CPUs: 14
Num logical CPUs: 28
Num usable CPUs: 28
CPU usage (%): [11.5, 60.4, 100.0, 9.8, 100.0, 24.6, 100.0, 29.0, 8.8, 100.0, 100.0, 100.0, 100.0, 100.0, 11.5, 10.5, 8.7, 9.7, 7.8, 7.8, 8.7, 8.7, 6.9, 8.7, 8.7, 7.8, 8.7, 8.7]
CPU freq. (MHz): 2874
Load avg. in last 1, 5, 15 mins (%): [30.8, 27.3, 27.1]
Disk usage (%): 52.6
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 125.5
Available memory (GB): 14.4
Used memory (GB): 44.3

================================
Printing GPU config...
================================
Num GPUs: 2
Has CUDA: True
CUDA version: 11.1
cuDNN enabled: True
cuDNN version: 8005
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: Quadro RTX 8000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 72
GPU 0 Total memory (GB): 47.5
GPU 0 CUDA capability (maj.min): 7.5
GPU 1 Name: Quadro RTX 8000
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 72
GPU 1 Total memory (GB): 47.5
GPU 1 CUDA capability (maj.min): 7.5
rijobro commented 2 years ago

cache_rate is 1.0 by default. Have you experimented with lower fractions to see if that has an impact? I wonder if the total memory of your loaded dataset is so big that it's causing problems.

neuronflow commented 2 years ago

Thanks.

I am working on LiTS data atm: https://competitions.codalab.org/competitions/15595

the total training set is ~41gb atm. I can experiment with lower values of cache_rate, which values should I try and why should they matter?

I also thought about using https://docs.monai.io/en/latest/data.html#cachentransdataset to only cache the nifti reading. But the random part of the augmentation starts closely afterwards anyway.

rijobro commented 2 years ago

~41GB on disk or with the transforms applied? From the screenshot of your htop of server B, I wondered if it was because you were running out of memory. Perhaps try a rather low value (e.g., 0.1) and although it'll be slow at least epoch times should be constant. After that maybe you can increase until you find the cut off.

I'm surprised you were seeing the same problem with the PersistentDataset, though

neuronflow commented 2 years ago

41gb on disk no tfs.

The training here just uses scaling (setting ct window) and random flips for augmentation.

Here a training with the normal Dataset class (red) vs. the SmartCacheDataset (blue) from above:

Screenshot 2021-10-13 at 12 48 31

As mentioned before the spikes every 10 are my extended validation epochs.

I also experimented with the pin_memory setting in the data loader with inconclusive results.

The increased epoch durations around epoch 50 are probably due to a small side task running on the server (which did not happen for the blue training.

I am a bit puzzled, apparently caching brings no added value at all for this training?

rijobro commented 2 years ago

Let's not complicate things with SmartCacheDataset just yet. I'm not sure how you've parameterised your SmartCacheDataset, but for example if replace_rate=1, all data is going to be refreshed every time, probably with extra overhead compared to Dataset. What happens if you use CacheDataset with cache_rate=0.1?

neuronflow commented 2 years ago

Thanks, I used replace_rate=0.25, will do your suggested experiment and report back.

neuronflow commented 2 years ago

Now with cache_rate=0.1, I only trained a few epochs yet, but one can already observe a doubling in epoch duration.

Screenshot 2021-10-17 at 19 13 32
neuronflow commented 2 years ago

training continued:

Screenshot 2021-10-18 at 00 51 04
neuronflow commented 2 years ago

This sounds potentially related: https://github.com/Project-MONAI/MONAI/issues/1589

Nic-Ma commented 2 years ago

Hi @neuronflow ,

That's an interesting issue and seems hard to analyze. Is there a way to print the memory / CPU / GPU utilization curve in parallel? Then maybe we can get what's happening when your epoch time goes up? @ericspod do you have any related experience?

Thanks in advance.

rijobro commented 2 years ago

We should be able to rule the GPU out fairly easily by modifying the code to not copy to the GPU. i.e., just:

transforms = Compose(some_transforms)
ds = CacheDataset(big_data, transforms)
dl = DataLoader(ds)
max_epoch = 1000
for i in range(max_epoch):
    for data in dl:
        pass

Initially some_transforms could be all deterministic. If we don't see the slow down, we could then modify it to be a mixture of deterministic and random transforms.

If we still don't see the slow down, then we can gradually increase the complexity with e.g., copying to GPU, etc.

ericspod commented 2 years ago

@rijobro is suggested the right experiment to do, cut down on variables such as random transforms and GPU. It'll be slow under CPU only but that's fine. One other issue is that the OS is managing memory and may be making poor choices about paging or other factors that would impact memory usage, both main memory and GPU memory. The CUDA drivers and Pytorch layer also may be adding additional confounding factors so narrowing it down to CPU only would help observe what's going on.

neuronflow commented 2 years ago
That's an interesting issue and seems hard to analyze.
Is there a way to print the memory / CPU / GPU utilization curve in parallel? Then maybe we can get what's happening when your epoch time goes up?

I agree it's hard to narrow it down. I am currently working on a shared server and there is some traffic. I might be able to provide this once there is no traffic again (like when I reported the initial issue that was based on an isolated workstation and the shared server with no traffic).

 memory / CPU / GPU utilization curve in parallel?

Any preferred way of doing this? e.g. via tensorboard?

Meanwhile, I experimented a bit and commented all model training related stuff out. Just looping through the dataloader:

for counter, data in enumerate(tqdm(train_loader, 0)):
            inputs, labels = (
            data["images"].float().to(device),
            data["label"].float().to(device),
        )

with the transforms reduced to:

transforms = Compose(
    [
        LoadImageD(keys=["images", "label"]),
        AddChanneld(keys=["images"]),
        ToTensord(keys=["images", "label"]),
   ]
)

takes an awful long time with both normal Dataset an the cached variants. I believe one can also observe the duration increase for the cached variants (hard to tell with the current traffic).

Again I am working on the LiTS dataset: https://competitions.codalab.org/competitions/15595

I also experimented with:

# torch.multiprocessing.set_sharing_strategy("file_system")

According to this link, it might lead to memory leaking: https://github.com/PyTorchLightning/pytorch-lightning/issues/2352

How would this parameter interact with the cached datasets?

rijobro commented 2 years ago

The following code:

import time

from monai.data import CacheDataset, DataLoader
from tests.utils import create_test_image_3d, make_nifti_image
from monai.transforms import Compose, LoadImaged, AddChanneld, ToTensord

def main():
    im_shape = 100
    num_ims = 500
    image, label = (make_nifti_image(i) for i in create_test_image_3d(im_shape, im_shape, im_shape, noise_max=100))
    data = [{"image": image, "label": label} for _ in range(num_ims)]

    transforms = Compose([
        LoadImaged(keys=["image", "label"]),
        AddChanneld(keys=["image"]),
        ToTensord(keys=["image", "label"]),
    ])

    ds = CacheDataset(data, transforms)
    dl = DataLoader(ds, num_workers=5)
    num_epochs = 20
    times = []
    for i in range(num_epochs):
        t0 = time.time()
        for _ in dl:
            pass
        times.append(time.time() - t0)
        print(f"time for epoch {i} = {times[-1]}")

if __name__ == "__main__":
    main()

Gave this as output:

time for epoch 0 = 14.29433560371399
time for epoch 1 = 14.047232866287231
time for epoch 2 = 14.039271354675293
time for epoch 3 = 14.382421493530273
time for epoch 4 = 14.286617517471313
time for epoch 5 = 14.642267942428589
time for epoch 6 = 14.51481580734253
time for epoch 7 = 14.36440372467041
time for epoch 8 = 14.417043685913086
time for epoch 9 = 14.654301881790161
time for epoch 10 = 14.619521141052246
time for epoch 11 = 14.621773719787598
time for epoch 12 = 14.666800260543823
time for epoch 13 = 14.868067979812622
time for epoch 14 = 14.978628873825073
time for epoch 15 = 14.582119941711426
time for epoch 16 = 14.34918737411499
time for epoch 17 = 14.598304033279419
time for epoch 18 = 14.281646251678467
time for epoch 19 = 14.260899543762207

Which seems sufficiently consistent to me.

Do you think you'd be able to use it as starting point and gradually increase complexity/image sizes until you replicate your problem?

neuronflow commented 2 years ago

Thank you. Unfortunately, there is quite some traffic on the server now, so recording accurate timings is not possible. I will wait for a moment with no traffic and try to run the experiment and report back.

In the meantime, I started computations on another non-shared machine and also observed a pronounced increase in epoch times over time. Here I could pinpoint the issue to the data loaders parameter pin_memory:

true setting: massive increase in epoch times, after 1-2 epochs, kernels become locked:

Screenshot 2021-10-25 at 10 22 52

false setting, the training runs smooth:

Screenshot 2021-10-25 at 21 43 18

After reading this post and the linked blog post a while back I naively turned this parameter to true for all of my trainings: https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723

Unfortunately, my knowledge of computing in general, and low-level computing in particular, is quite limited. A more knowledgable friend hypothesized: "the true setting limits the scheduler's freedom in allocating RAM. This way the system becomes increasingly locked and busy with garbage collection processes."

This theory would also explain why I only observe problems with bigger datasets, as the whole RAM becomes more and more occupied. Unlike the smaller machine above, the shared server has 512GB of RAM so it is much harder to be completely jammed.

I then continued to optimize my epoch times by testing cached vs. non-cached data loaders for this LiTS segmentation training. I again observed no difference in training duration for cached vs non cached data loaders in a few short test runs.

Could you please explain the idea behind the cached data loaders a bit, do they depend on the pin_memory parameter to be true to create value (like they do for my trainings on smaller datasets)?

I will try to give this more thorough experimentation once my deadlines are over.

Thank you for your assistance so far! I really appreciate the effort.