Open neuronflow opened 2 years ago
cache_rate
is 1.0 by default. Have you experimented with lower fractions to see if that has an impact? I wonder if the total memory of your loaded dataset is so big that it's causing problems.
Thanks.
I am working on LiTS data atm: https://competitions.codalab.org/competitions/15595
the total training set is ~41gb atm. I can experiment with lower values of cache_rate, which values should I try and why should they matter?
I also thought about using https://docs.monai.io/en/latest/data.html#cachentransdataset to only cache the nifti reading. But the random part of the augmentation starts closely afterwards anyway.
~41GB on disk or with the transforms applied? From the screenshot of your htop of server B, I wondered if it was because you were running out of memory. Perhaps try a rather low value (e.g., 0.1) and although it'll be slow at least epoch times should be constant. After that maybe you can increase until you find the cut off.
I'm surprised you were seeing the same problem with the PersistentDataset
, though
41gb on disk no tfs.
The training here just uses scaling (setting ct window) and random flips for augmentation.
Here a training with the normal Dataset
class (red) vs. the SmartCacheDataset
(blue) from above:
As mentioned before the spikes every 10 are my extended validation epochs.
I also experimented with the pin_memory
setting in the data loader with inconclusive results.
The increased epoch durations around epoch 50 are probably due to a small side task running on the server (which did not happen for the blue training.
I am a bit puzzled, apparently caching brings no added value at all for this training?
Let's not complicate things with SmartCacheDataset
just yet. I'm not sure how you've parameterised your SmartCacheDataset
, but for example if replace_rate=1
, all data is going to be refreshed every time, probably with extra overhead compared to Dataset
. What happens if you use CacheDataset
with cache_rate=0.1
?
Thanks, I used replace_rate=0.25
, will do your suggested experiment and report back.
Now with cache_rate=0.1
, I only trained a few epochs yet, but one can already observe a doubling in epoch duration.
training continued:
This sounds potentially related: https://github.com/Project-MONAI/MONAI/issues/1589
Hi @neuronflow ,
That's an interesting issue and seems hard to analyze. Is there a way to print the memory / CPU / GPU utilization curve in parallel? Then maybe we can get what's happening when your epoch time goes up? @ericspod do you have any related experience?
Thanks in advance.
We should be able to rule the GPU out fairly easily by modifying the code to not copy to the GPU. i.e., just:
transforms = Compose(some_transforms)
ds = CacheDataset(big_data, transforms)
dl = DataLoader(ds)
max_epoch = 1000
for i in range(max_epoch):
for data in dl:
pass
Initially some_transforms
could be all deterministic. If we don't see the slow down, we could then modify it to be a mixture of deterministic and random transforms.
If we still don't see the slow down, then we can gradually increase the complexity with e.g., copying to GPU, etc.
@rijobro is suggested the right experiment to do, cut down on variables such as random transforms and GPU. It'll be slow under CPU only but that's fine. One other issue is that the OS is managing memory and may be making poor choices about paging or other factors that would impact memory usage, both main memory and GPU memory. The CUDA drivers and Pytorch layer also may be adding additional confounding factors so narrowing it down to CPU only would help observe what's going on.
That's an interesting issue and seems hard to analyze.
Is there a way to print the memory / CPU / GPU utilization curve in parallel? Then maybe we can get what's happening when your epoch time goes up?
I agree it's hard to narrow it down. I am currently working on a shared server and there is some traffic. I might be able to provide this once there is no traffic again (like when I reported the initial issue that was based on an isolated workstation and the shared server with no traffic).
memory / CPU / GPU utilization curve in parallel?
Any preferred way of doing this? e.g. via tensorboard?
Meanwhile, I experimented a bit and commented all model training related stuff out. Just looping through the dataloader:
for counter, data in enumerate(tqdm(train_loader, 0)):
inputs, labels = (
data["images"].float().to(device),
data["label"].float().to(device),
)
with the transforms reduced to:
transforms = Compose(
[
LoadImageD(keys=["images", "label"]),
AddChanneld(keys=["images"]),
ToTensord(keys=["images", "label"]),
]
)
takes an awful long time with both normal Dataset an the cached variants. I believe one can also observe the duration increase for the cached variants (hard to tell with the current traffic).
Again I am working on the LiTS dataset: https://competitions.codalab.org/competitions/15595
I also experimented with:
# torch.multiprocessing.set_sharing_strategy("file_system")
According to this link, it might lead to memory leaking: https://github.com/PyTorchLightning/pytorch-lightning/issues/2352
How would this parameter interact with the cached datasets?
The following code:
import time
from monai.data import CacheDataset, DataLoader
from tests.utils import create_test_image_3d, make_nifti_image
from monai.transforms import Compose, LoadImaged, AddChanneld, ToTensord
def main():
im_shape = 100
num_ims = 500
image, label = (make_nifti_image(i) for i in create_test_image_3d(im_shape, im_shape, im_shape, noise_max=100))
data = [{"image": image, "label": label} for _ in range(num_ims)]
transforms = Compose([
LoadImaged(keys=["image", "label"]),
AddChanneld(keys=["image"]),
ToTensord(keys=["image", "label"]),
])
ds = CacheDataset(data, transforms)
dl = DataLoader(ds, num_workers=5)
num_epochs = 20
times = []
for i in range(num_epochs):
t0 = time.time()
for _ in dl:
pass
times.append(time.time() - t0)
print(f"time for epoch {i} = {times[-1]}")
if __name__ == "__main__":
main()
Gave this as output:
time for epoch 0 = 14.29433560371399
time for epoch 1 = 14.047232866287231
time for epoch 2 = 14.039271354675293
time for epoch 3 = 14.382421493530273
time for epoch 4 = 14.286617517471313
time for epoch 5 = 14.642267942428589
time for epoch 6 = 14.51481580734253
time for epoch 7 = 14.36440372467041
time for epoch 8 = 14.417043685913086
time for epoch 9 = 14.654301881790161
time for epoch 10 = 14.619521141052246
time for epoch 11 = 14.621773719787598
time for epoch 12 = 14.666800260543823
time for epoch 13 = 14.868067979812622
time for epoch 14 = 14.978628873825073
time for epoch 15 = 14.582119941711426
time for epoch 16 = 14.34918737411499
time for epoch 17 = 14.598304033279419
time for epoch 18 = 14.281646251678467
time for epoch 19 = 14.260899543762207
Which seems sufficiently consistent to me.
Do you think you'd be able to use it as starting point and gradually increase complexity/image sizes until you replicate your problem?
Thank you. Unfortunately, there is quite some traffic on the server now, so recording accurate timings is not possible. I will wait for a moment with no traffic and try to run the experiment and report back.
In the meantime, I started computations on another non-shared machine and also observed a pronounced increase in epoch times over time. Here I could pinpoint the issue to the data loaders parameter pin_memory
:
true
setting:
massive increase in epoch times, after 1-2 epochs, kernels become locked:
false
setting, the training runs smooth:
After reading this post and the linked blog post a while back I naively turned this parameter to true
for all of my trainings: https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723
Unfortunately, my knowledge of computing in general, and low-level computing in particular, is quite limited. A more knowledgable friend hypothesized: "the true setting limits the scheduler's freedom in allocating RAM. This way the system becomes increasingly locked and busy with garbage collection processes."
This theory would also explain why I only observe problems with bigger datasets, as the whole RAM becomes more and more occupied. Unlike the smaller machine above, the shared server has 512GB of RAM so it is much harder to be completely jammed.
I then continued to optimize my epoch times by testing cached vs. non-cached data loaders for this LiTS segmentation training. I again observed no difference in training duration for cached vs non cached data loaders in a few short test runs.
Could you please explain the idea behind the cached data loaders a bit, do they depend on the pin_memory
parameter to be true
to create value (like they do for my trainings on smaller datasets)?
I will try to give this more thorough experimentation once my deadlines are over.
Thank you for your assistance so far! I really appreciate the effort.
Describe the bug On two servers I observe substantially increasing epoch durations when training with cached datasets. The effects are way more pronounced on server A (64cores) with no swap than on Server B (14cores) with now 144gb of swap memory.
The trainings start smooth and then become slower after a while. Unfortunately, I can atm only access server B now to provide logs. This should change in the next days.
I tried several recent pytorch and Monai versions and building pytorch from source. Furthermore I tried using:
torch.cuda.empty_cache()
after every epoch.Furthermore I tried different values from the amount of workers from 0 to max(cpu_cores) * 2 for each machine.
On both servers I notice increased "red kernels" indicating wait for system processes in htop when the servers slow down: server A:
server B:
Also attached a screenshot from server B with the increased epoch durations (the spike every few epochs is from an extended validation epoch). One can see how the epoch times go up from ~350s in the beginning to over 2000s. Notice, this effect is way more pronounced on server A (orders of magnitude), I hope I can present some timings from training on server A without special validation epochs soon.
Shortly before epoch 80 I increased the swap memory file from 16 to 144gb, which seems to have helped a bit.
Interestingly this effect seems to be only there for big datasets. For smaller datasets everything seems to be okay, one can see slightly decreasing epoch times, which I attribute to the caching?
Notably, nobody else was using the servers for my tests.
To Reproduce for example run: https://github.com/Project-MONAI/tutorials/blob/master/3d_segmentation/challenge_baseline/run_net.py
other cached datasets I tried: PersistentDataset CacheDataset SmartCacheDataset¶
Expected behavior epoch times should remain constant like for the training with the small dataset.
Environment Ensuring you use the relevant python executable, please paste the output of: