byaman14 / SSDU

46 stars 17 forks source link

running train.py #3

Closed ZhengguoTan closed 2 years ago

ZhengguoTan commented 2 years ago

Dear Dr. Yaman,

I managed to setup the kspace_dir, coil_dir, and mask_dir, and start to run train.py.

However, it seems some problem occurs at https://github.com/byaman14/SSDU/blob/main/train.py#L136

Here are the terminal output I got: " ... Normalize the kspace to 0-1 region

size of kspace: (16, 640, 320, 20) , maps: (16, 640, 320, 20) , mask: (640, 320)

create training and loss masks and generate network inputs...

Iteration: 0

Gaussian selection is processing, rho = 0.40, center of kspace: center-kx: 319, center-ky: 160

size of ref kspace: (16, 20, 640, 320, 2) , nw_input: (16, 640, 320, 2) , maps: (16, 20, 640, 320) , mask: (16, 640, 320) SSDU Parameters: Epochs: 100 , Batch Size: 1 , Number of trainable parameters: 592129 Training... Traceback (most recent call last): File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node SSDUModel/Weights/mapCG_4/while/LoopCond}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 140, in tmp, , = sess.run([loss, update_ops, optimizer]) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[node SSDUModel/Weights/mapCG_4/while/LoopCond (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py:121) ]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3', defined at: File "train.py", line 113, in nw_output_img, nw_outputkspace, * = UnrollNet.UnrolledNet(nw_input_tensor, sens_maps_tensor, trn_mask_tensor, loss_mask_tensor).model File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 43, in init self.model = self.Unrolled_SSDU() File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 61, in Unrolled_SSDU x = ssdu_dc.dc_block(rhs, self.sens_maps, self.trn_mask, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 121, in dc_block dc_block_output = tf.map_fn(cg_map_func, (rhs, sens_maps, mask), dtype=tf.float32, name='mapCG') File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 497, in map_fn maximum_iterations=n) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in body = lambda i, lv: (i + 1, orig_body(lv)) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 486, in compute packed_fn_values = fn(packed_values) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 117, in cg_map_func cg_output = conj_grad(input_elems, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 106, in conj_grad cg_out = tf.while_loop(cond, body, loop_vars, name='CGloop', parallel_iterations=1)[2] File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 91, in body Ap = Encoder.EhE_Op(p, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 30, in EhE_Op kspace = tf_utils.tf_fftshift(tf.fft2d(tf_utils.tf_ifftshift(coil_imgs))) / self.scalar File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 154, in tf_ifftshift return tf_ifftshift_flip2D(tf_ifftshift_flip2D(input_x, axes=1), axes=2) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 125, in tf_ifftshift_flip2D second_half = tf.identity(input_data[:, :, ny:]) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 654, in _slice_helper name=name) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 820, in strided_slice shrink_axis_mask=shrink_axis_mask) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9356, in strided_slice shrink_axis_mask=shrink_axis_mask, name=name) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20,640,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_1/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[node SSDUModel/Weights/mapCG_4/while/LoopCond (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py:121) ]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. ... "

It reads quite complicated to myself. Would you know what the problem might be here?

Thank you in advance for your help!

Best Regards, Zhengguo

byaman14 commented 2 years ago

Hi Zhengguo,

This issue occurs due to GPU memory limitations. The data you are trying to fit is very large. There are several things you can do to resolve this issue. 1) Remove oversampling in your data. The dataset you have oversampled along kx direction. You can remove oversampling which is a common practice in DL based MRI reconstruction. In your case, after removing oversampling, kspace/ sensitivity maps will have size of nSlices x 320x320 xncoils and mask should have size of 320x320. I expect this to resolve the issue.
2) You can reduce the complexity of the network by reducing the number of unrolling networks as well as number of unroll blocks.

ZhengguoTan commented 2 years ago

Dear Dr. Yaman,

I cut the data to 2 coils and 2 slices. Besides, I allocate 8 GPUs in the sbatch job. Here is the output:

"... +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A | | 30% 33C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A | | 30% 33C P8 19W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A | | 30% 33C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A | | 30% 32C P8 23W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A | | 30% 32C P8 22W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A | | 30% 33C P8 15W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce ... On | 00000000:DA:00.0 Off | N/A | | 30% 34C P8 14W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce ... On | 00000000:DB:00.0 Off | N/A | | 30% 33C P8 9W / 300W | 1MiB / 10240MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Finished TaskPrologue

2022-06-10 21:08:26.287009: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 288 Chunks of size 26214400 totalling 7.03GiB 2022-06-10 21:08:26.287014: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 9.09GiB 2022-06-10 21:08:26.287020: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit: 9765666816 InUse: 9764954880 MaxInUse: 9764954880 NumAllocs: 6449 MaxAllocSize: 1888026624

2022-06-10 21:08:26.287116: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2022-06-10 21:08:26.287139: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at strided_slice_op.cc:139 : Resource exhausted: OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

create a test model for the testing

Test graph is generated and saved at: saved_models/SSDU_toy_100Epochs_Rate4_10Unrolls_GaussianSelection/model_test .................SSDU Training.....................

Loading toy data, acc rate : 4 , mask type : Gaussian

kspace dir : /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_unsamp_kdat.h5

coil dir : /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_coil.h5

mask dir: /home/woody/iwbi/iwbi005h/fastMRI/data/multicoil_train/file_brain_AXFLAIR_200_6002425_unsamp_mask.h5

Normalize the kspace to 0-1 region

size of kspace: (2, 320, 320, 2) , maps: (2, 320, 320, 2) , mask: (320, 320)

create training and loss masks and generate network inputs...

Iteration: 0

Gaussian selection is processing, rho = 0.40, center of kspace: center-kx: 160, center-ky: 160

size of ref kspace: (2, 2, 320, 320, 2) , nw_input: (2, 320, 320, 2) , maps: (2, 2, 320, 320) , mask: (2, 320, 320) SSDU Parameters: Epochs: 100 , Batch Size: 1 , Number of trainable parameters: 592129 Training... Traceback (most recent call last): File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node gradients/b_count_54}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 140, in tmp, , = sess.run([loss, update_ops, optimizer]) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[node gradients/b_count_54 (defined at train.py:120) ]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3', defined at: File "train.py", line 113, in nw_output_img, nw_outputkspace, * = UnrollNet.UnrolledNet(nw_input_tensor, sens_maps_tensor, trn_mask_tensor, loss_mask_tensor).model File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 43, in init self.model = self.Unrolled_SSDU() File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/UnrollNet.py", line 61, in Unrolled_SSDU x = ssdu_dc.dc_block(rhs, self.sens_maps, self.trn_mask, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 121, in dc_block dc_block_output = tf.map_fn(cg_map_func, (rhs, sens_maps, mask), dtype=tf.float32, name='mapCG') File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 497, in map_fn maximum_iterations=n) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in body = lambda i, lv: (i + 1, orig_body(lv)) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 486, in compute packed_fn_values = fn(packed_values) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 117, in cg_map_func cg_output = conj_grad(input_elems, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 106, in conj_grad cg_out = tf.while_loop(cond, body, loop_vars, name='CGloop', parallel_iterations=1)[2] File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 91, in body Ap = Encoder.EhE_Op(p, mu) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/data_consistency.py", line 30, in EhE_Op kspace = tf_utils.tf_fftshift(tf.fft2d(tf_utils.tf_ifftshift(coil_imgs))) / self.scalar File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 154, in tf_ifftshift return tf_ifftshift_flip2D(tf_ifftshift_flip2D(input_x, axes=1), axes=2) File "/home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py", line 125, in tf_ifftshift_flip2D second_half = tf.identity(input_data[:, :, ny:]) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 654, in _slice_helper name=name) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 820, in strided_slice shrink_axis_mask=shrink_axis_mask) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9356, in strided_slice shrink_axis_mask=shrink_axis_mask, name=name) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/hpc/iwbi/iwbi005h/.conda/envs/ssdu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,320,160] and type complex64 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node SSDUModel/Weights/mapCG_8/while/CGloop/CGIters/EhE/strided_slice_3 (defined at /home/hpc/iwbi/iwbi005h/Softwares/SSDU/tf_utils.py:125) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.tensor

 [[node gradients/b_count_54 (defined at train.py:120) ]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

..."

I am a bit lost - not sure what the error is and how I should debug the tensorflow code.

Best Regards, Zhengguo

byaman14 commented 2 years ago

Hi Zhengguo,

I think the issue is related to your GPUs. Note provided codes are for single GPU. If you would like to use multi-GPU setting, you need to incorporate relevant tensorflow functions. However, for now, you can do training with 1 GPU to run it on your dataset. To do so, please use 1 GPU that has a good memory. In training file, train.py , it tries to use the first GPU on your machine (os.environ["CUDA_VISIBLE_DEVICES"] = "0"). You can adjust this to the GPU you are using. We performed experiments with V100s that has 32GB memory. However, it seems each of your GPU device has around 10 GB. Could you please try with a single GPU that has a good memory. If you don't have GPUs with such memory size, you may need to decrease number of unrolled blocks and residual blocks.

ZhengguoTan commented 2 years ago

Dear Dr. Yaman,

I switched to GPU V100, which solves my problem.

Thanks for your help!

Best Regards, Zhengguo