Low GPU Utilization on RTX 4090

Num13er-XIII commented 2 months ago

Hi Fabian,

I’m currently training a public dataset of coronary artery CTAs using nnU-Net, but I'm encountering an issue with GPU utilization. My system includes an RTX 4090 GPU, a 12700K CPU, 64 GB of RAM, and a Gen4 SSD, running under WSL2.

Previously, I successfully trained nnU-Net on smaller datasets with full GPU utilization. However, with this larger dataset, my GPU shows 0% usage most of the time, spiking only to 100% briefly. Each epoch is taking about 800 seconds to complete, which seems excessively long.

Could this be related to WSL2 or some configuration within nnU-Net? I've attached screenshots of my GPU utilization graph and the training plan file. What might be causing these utilization issues, and how can I improve the training performance?

Thank you for your insights!

Configuration name: 3d_fullres {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [96, 160, 160], 'median_image_size_in_voxels': [275.0, 509.0, 512.0], 'spacing': [0.5, 0.349609375, 0.349609375], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [4, 5, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings: {'dataset_name': 'Dataset003_ImageCAS', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [0.5, 0.349609375, 0.349609375], 'original_median_shape_after_transp': [275, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1.0, 'mean': 0.39980047941207886, 'median': 0.3513513505458832, 'min': 0.0, 'percentile_00_5': 0.0, 'percentile_99_5': 1.0, 'std': 0.22766844928264618}}}

Num13er-XIII commented 2 months ago

Also, I can fit a batch size of 5 in the GPU, but it will increase epoch time to 1800 seconds without improving GPU utilization

wuhu0210 commented 2 months ago

Exactly the same problem...but it worked well and fast on linux, maybe source code has some problem working on win32.

Num13er-XIII commented 2 months ago

Update:

Same situation on linux system with a A6000 ada, but I should mention that the database is stored on a ntfs windows partition

Num13er-XIII commented 2 months ago

Exactly the same problem...but it worked well and fast on linux, maybe source code has some problem working on win32.

I faced same issue on a linux system too, but my data is stored on windows partition in a local server

wuhu0210 commented 1 month ago

Exactly the same problem...but it worked well and fast on linux, maybe source code has some problem working on win32.

I faced same issue on a linux system too, but my data is stored on windows partition in a local server

I also found it works well on small dataset(about 8 GB after unpacked) in windows, but extremely slow on my whole dataset (about 340GB after unpacked). But I have not tried wsl2, while on payed ubuntu platform it works normally, I don't know why...

MIC-DKFZ / nnUNet

Low GPU Utilization on RTX 4090 #2085