Closed ZhengguoTan closed 2 years ago
Hi Zhengguo,
This issue occurs due to GPU memory limitations. The data you are trying to fit is very large. There are several things you can do to resolve this issue.
1) Remove oversampling in your data. The dataset you have oversampled along kx direction. You can remove oversampling which is a common practice in DL based MRI reconstruction. In your case, after removing oversampling, kspace/ sensitivity maps will have size of nSlices x 320x320 xncoils and mask should have size of 320x320. I expect this to resolve the issue.
2) You can reduce the complexity of the network by reducing the number of unrolling networks as well as number of unroll blocks.
Dear Dr. Yaman,
I cut the data to 2 coils and 2 slices. Besides, I allocate 8 GPUs in the sbatch job. Here is the output:
"... +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A | | 30% 33C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A | | 30% 33C P8 19W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A | | 30% 33C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A | | 30% 32C P8 23W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A | | 30% 32C P8 22W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A | | 30% 33C P8 15W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... On | 00000000:DA:00.0 Off | N/A | | 30% 34C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... On | 00000000:DB:00.0 Off | N/A | | 30% 33C P8 9W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
2022-06-10 21:08:26.287009: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 288 Chunks of size 26214400 totalling 7.03GiB 2022-06-10 21:08:26.287014: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 9.09GiB 2022-06-10 21:08:26.287020: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit: 9765666816 InUse: 9764954880 MaxInUse: 9764954880 NumAllocs: 6449 MaxAllocSize: 1888026624
2022-06-10 21:08:26.287116: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2022-06-10 21:08:26.287139: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at strided_slice_op.cc:139 : Resource exhausted: OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
create a test model for the testing
Test graph is generated and saved at: saved_models/SSDU_toy_100Epochs_Rate4_10Unrolls_GaussianSelection/model_test .................SSDU Training.....................
Loading toy data, acc rate : 4 , mask type : Gaussian
kspace dir : /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_unsamp_kdat.h5
coil dir : /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_coil.h5
mask dir: /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_unsamp_mask.h5
Normalize the kspace to 0-1 region
size of kspace: (2, 320, 320, 2) , maps: (2, 320, 320, 2) , mask: (320, 320)
create training and loss masks and generate network inputs...
Iteration: 0
Gaussian selection is processing, rho = 0.40, center of kspace: center-kx: 160, center-ky: 160
size of ref kspace: (2, 2, 320, 320, 2) , nw_input: (2, 320, 320, 2) , maps: (2, 2, 320, 320) , mask: (2, 320, 320) SSDU Parameters: Epochs: 100 , Batch Size: 1 , Number of trainable parameters: 592129 Training... Traceback (most recent call last): File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node gradients/b_count_54}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 140, in
[[node gradients/b_count_54 (defined at train.py:120) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3', defined at:
File "train.py", line 113, in
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.tensor
[[node gradients/b_count_54 (defined at train.py:120) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
..."
I am a bit lost - not sure what the error is and how I should debug the tensorflow code.
Best Regards, Zhengguo
Hi Zhengguo,
I think the issue is related to your GPUs. Note provided codes are for single GPU. If you would like to use multi-GPU setting, you need to incorporate relevant tensorflow functions. However, for now, you can do training with 1 GPU to run it on your dataset. To do so, please use 1 GPU that has a good memory. In training file, train.py
, it tries to use the first GPU on your machine (os.environ["CUDA_VISIBLE_DEVICES"] = "0"). You can adjust this to the GPU you are using. We performed experiments with V100s that has 32GB memory. However, it seems each of your GPU device has around 10 GB. Could you please try with a single GPU that has a good memory. If you don't have GPUs with such memory size, you may need to decrease number of unrolled blocks and residual blocks.
Dear Dr. Yaman,
I switched to GPU V100, which solves my problem.
Thanks for your help!
Best Regards, Zhengguo
Dear Dr. Yaman,
I managed to setup the kspace_dir, coil_dir, and mask_dir, and start to run train.py.
However, it seems some problem occurs at https://github.com/byaman14/SSDU/blob/main/train.py#L136
Here are the terminal output I got: " ... Normalize the kspace to 0-1 region
size of kspace: (16, 640, 320, 20) , maps: (16, 640, 320, 20) , mask: (640, 320)
create training and loss masks and generate network inputs...
Iteration: 0
Gaussian selection is processing, rho = 0.40, center of kspace: center-kx: 319, center-ky: 160
size of ref kspace: (16, 20, 640, 320, 2) , nw_input: (16, 640, 320, 2) , maps: (16, 20, 640, 320) , mask: (16, 640, 320) SSDU Parameters: Epochs: 100 , Batch Size: 1 , Number of trainable parameters: 592129 Training... Traceback (most recent call last): File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "train.py", line 140, in
tmp, , = sess.run([loss, update_ops, optimizer])
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3', defined at: File "train.py", line 113, in
nw_output_img, nw_outputkspace, * = UnrollNet.UnrolledNet(nw_input_tensor, sens_maps_tensor, trn_mask_tensor, loss_mask_tensor).model
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 43, in init
self.model = self.Unrolled_SSDU()
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 61, in Unrolled_SSDU
x = ssdu_dc.dc_block(rhs, self.sens_maps, self.trn_mask, mu)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 121, in dc_block
dc_block_output = tf.map_fn(cg_map_func, (rhs, sens_maps, mask), dtype=tf.float32, name='mapCG')
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 497, in map_fn
maximum_iterations=n)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(packed_vars_for_body)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in
body = lambda i, lv: (i + 1, orig_body( lv))
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 486, in compute
packed_fn_values = fn(packed_values)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 117, in cg_map_func
cg_output = conj_grad(input_elems, mu)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 106, in conj_grad
cg_out = tf.while_loop(cond, body, loop_vars, name='CGloop', parallel_iterations=1)[2]
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(packed_vars_for_body)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 91, in body
Ap = Encoder.EhE_Op(p, mu)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 30, in EhE_Op
kspace = tf_utils.tf_fftshift(tf.fft2d(tf_utils.tf_ifftshift(coil_imgs))) / self.scalar
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 154, in tf_ifftshift
return tf_ifftshift_flip2D(tf_ifftshift_flip2D(input_x, axes=1), axes=2)
File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 125, in tf_ifftshift_flip2D
second_half = tf.identity(input_data[:, :, ny:])
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 654, in _slice_helper
name=name)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 820, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9356, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(args, **kwargs)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. ... "
It reads quite complicated to myself. Would you know what the problem might be here?
Thank you in advance for your help!
Best Regards, Zhengguo