Closed chrisrapson closed 1 year ago
Please use persistent dataset instead of cache or smart cache... while training..
Thanks for the suggestion @SachidanandAlle, but it didn't work for me. Is radiology/main.py
the right place to make this change? I tried like this, but training still used more than 100GB of RAM/swap.
# Train
app.train(
request={
"model": args.model,
"max_epochs": 10,
"dataset": "PersistentDataset", # Dataset, PersistentDataset, CacheDataset
"train_batch_size": 1,
"val_batch_size": 1,
"multi_gpu": False,
"val_split": 0.1,
},
)
I wondered because making a change in main.py
means it will apply to all datasets, right? I would have thought it would be better to make this a dataset-specific setting, either in my segmentation_custom.py
or as a command line parameter?
And... while I was testing that, I also tried changing roi_size
to 32x32x32, but I still get the same GPU OOM error.
ok.. first thing you are having only 8GB of GPU on each GPU. following are the questions.
It's definitely not about ROI as you have tried to use smaller patches.. you are not able to run something on GPU before that.. some of the pre-transform
Ah right, I probably should have thought to look more closely at the error messages. One line above the OOM message was this line:
.../venv/lib/python3.8/site-packages/monai/inferers/utils.py", line 219, in sliding_window_inference
output_image_list.append(torch.zeros(output_shape, dtype=compute_dtype, device=device))
I managed to follow that back to the inferer()
method in radiology/lib/infers/segmentation.py
, and added an argument to the call to SlidingWindowInferer(roi_size=self.roi_size, device=torch.device('cpu'))
.
Unfortunately, that only postponed the OOM error until the post transforms. Again, following your advice I was able to read through the stack trace to discover that it was the EnsureType
type conversion which was trying to load the full image back into GPU memory. I was able to modify that line, and now it runs :-D
EnsureTyped(keys="pred", device=torch.device('cpu') if data else None),
Thanks for your help!! This lets me run inference, and with my swap workaround I can run training too. Is it worth considering turning this into a feature request for more defensive programming? Perhaps using torch.device('cpu')
for these operations unless the user explicitly enables GPU, or if the image size is guaranteed to fit in GPU memory? For nnunet, there is a command line argument --all-in-gpu
that serves this purpose. Having it disabled by default removes one potential source of problems.
Ah right, I probably should have thought to look more closely at the error messages. One line above the OOM message was this line:
.../venv/lib/python3.8/site-packages/monai/inferers/utils.py", line 219, in sliding_window_inference output_image_list.append(torch.zeros(output_shape, dtype=compute_dtype, device=device))
I managed to follow that back to the
inferer()
method inradiology/lib/infers/segmentation.py
, and added an argument to the call toSlidingWindowInferer(roi_size=self.roi_size, device=torch.device('cpu'))
.Unfortunately, that only postponed the OOM error until the post transforms. Again, following your advice I was able to read through the stack trace to discover that it was the
EnsureType
type conversion which was trying to load the full image back into GPU memory. I was able to modify that line, and now it runs :-DEnsureTyped(keys="pred", device=torch.device('cpu') if data else None),
Thanks for your help!! This lets me run inference, and with my swap workaround I can run training too. Is it worth considering turning this into a feature request for more defensive programming? Perhaps using
torch.device('cpu')
for these operations unless the user explicitly enables GPU, or if the image size is guaranteed to fit in GPU memory? For nnunet, there is a command line argument--all-in-gpu
that serves this purpose. Having it disabled by default removes one potential source of problems.
This is a good idea @chrisrapson Would you mind creating a PR or opening an issue proposing this?
The objective of making ALL GPU default is to demonstrate the power of GPUs.. If you have fairly good GPUs, you can see the training cost is cut down a lot.. in many situations, people want to see that happening as default specially when they have bought great GPUs on AWS/Cloud and want to see how fast they can achieve the performance. That's the main reason to keep these configs default to run everything in GPU.
And it is more obvious, we like to promote the GPU computation as part of AI paradigm as much as possible. That is one of the main objectives too. I still try run some basic things on my windows laptop with 6 GB GPU, but this is only for small image/model etc.. Otherwise Life is more complicated if I try to fit everything in low/small GPU.
Ok, I understand the trade-off between showcasing the speed of the GPU vs making the software more robust to larger images and larger datasets. That's a judgement call, so I'll leave it up to you. At a minimum, I'd suggest updating the documentation to highlight this potential issue and how to work around it.
Still, I'll list some arguments in favour of making all-in-gpu
an opt-in setting:
deepedit
model. I switched to the segmentation
model specifically because I read that it works on patches and doesn't load the full image into GPU RAM. Assuming that I understood that correctly, would it make sense to keep the deepedit
model all-in-gpu
by default, and the segmentation
model only does the inference for each patch in the GPU?Finally, I wonder if it would be worth considering a try-catch method, where you try doing everything in GPU and fall back to CPU if necessary?
P.S. I still have the open question as to why changing to a PersistentDataset didn't reduce my RAM usage during training.
you can calculate the dump size of PersistentDataset.. you might notice a difference only if the dump is reasonable.. persistence cached saved into your model/xyz/train_xy folder.. cache or .cache..
and also you can dig details to see how much is loaded into GPU vs CPU.. if your pre-transform is cached after loading data into GPU.. the corresponding tensor gets saved on the disk.. some mem profilers can help to know a bit more.
i understand on supporting gpu vs non-gpu enforcement in some of the examples. it can be a good config.. and for your segmentation_xxx model, you can do something like this..
def train_pre_transforms(self, context: Context):
t = [
LoadImaged(keys=("image", "label")),
EnsureChannelFirstd(keys=("image", "label")),
Spacingd(
keys=("image", "label"),
pixdim=(1.0, 1.0, 1.0),
mode=("bilinear", "nearest"),
),
ScaleIntensityRanged(keys="image", a_min=-57, a_max=164, b_min=0.0, b_max=1.0, clip=True),
CropForegroundd(keys=("image", "label"), source_key="image"),
]
if context.request.get("all-in-gpu", False):
t.append(EnsureTyped(keys=("image", "label"), device=context.device))
t.extend([
RandCropByPosNegLabeld(
keys=("image", "label"),
label_key="label",
spatial_size=(96, 96, 96),
pos=1,
neg=1,
num_samples=4,
image_key="image",
image_threshold=0,
),
RandShiftIntensityd(keys="image", offsets=0.1, prob=0.5),
SelectItemsd(keys=("image", "label")),
])
Ok, I understand the trade-off between showcasing the speed of the GPU vs making the software more robust to larger images and larger datasets. That's a judgement call, so I'll leave it up to you. At a minimum, I'd suggest updating the documentation to highlight this potential issue and how to work around it.
Still, I'll list some arguments in favour of making
all-in-gpu
an opt-in setting:
- If I had a big enough GPU, I would have chosen to run the
deepedit
model. I switched to thesegmentation
model specifically because I read that it works on patches and doesn't load the full image into GPU RAM. Assuming that I understood that correctly, would it make sense to keep thedeepedit
modelall-in-gpu
by default, and thesegmentation
model only does the inference for each patch in the GPU?- I don't think our images or our dataset are particularly large, so assume I'm not the only one who will hit this problem. If I was in your position as a dev, I would try to avoid being asked these questions repeatedly, so robustness on all kinds of hardware would be a priority for me, as well as providing instructions for how to get more performance on an opt-in basis.
- There are images out there (e.g. high resolution multi-modal MRIs) that will exceed the capacity of even SOTA GPUs, and/or I might achieve better accuracy with some other hyperparameters setting that uses lots of GPU RAM, e.g. increasing the patch size or using a transformer model. Increased accuracy will save me more time overall because I won't have to make as many manual changes.
Finally, I wonder if it would be worth considering a try-catch method, where you try doing everything in GPU and fall back to CPU if necessary?
P.S. I still have the open question as to why changing to a PersistentDataset didn't reduce my RAM usage during training.
if you suggestions to support this config for an example app, please feel free to create a PR so that users of both type can enjoy the benefit. may be by default we keep it enabled and if one wants to run on smaller gpus, they can use addition conf option (which will not push some pre-transforms etc on gpu)
I am closing the issue as no further update on this thread. feel free to reopen
Describe the bug I have encountered two different situations where monai label is using far more memory than I would expect. Are these user errors, or related to my dataset? Has monai label been designed with scalability in mind?
In case it is relevant, my dataset has 12 foreground labels.
My work around solution is to use swap, but obviously that's not ideal.
RUN
gives me another OOM error. I tried decreasing theroi_size
for my model, but even at64x64x64
I'm still exceeding the 8GB of GPU VRAM available:For 128x128x128
For 96x96x96
For 64x64x64
It might be expected behaviour for
deepedit
models to cause an OOM, since they run on the full image. However, I expected that asegmentation
model would scale to arbitrary sized images, because it analyses the image in patches. Have I misunderstood something? Or is the stitching of the patches also carried out in the GPU?To Reproduce Steps to reproduce the behavior:
radiology/lib/config/segmentation.py
file (e.g.segmentation_custom.py
) and modify the foreground classes androi_size
.Train
.Next Sample
to get an unlabelled image and thenRun
to automatically generate labels.Expected behavior I expected to be able to train a network and run inference on a dataset with an arbitrary number of arbitrarily sized images.
I've used 128x128x128 patches with nnunet, and been able to run inference on GPUs with only 4GB of VRAM. I'm surprised that an 8GB GPU gets an OOM when trying to run the segmentation network with 64x64x64 patches.
8GB of GPU memory was enough to train the network, so I assumed it would also be enough to run inference.
Screenshots N/A
Environment
Ensuring you use the relevant python executable, please paste the output of:
Additional context N/A