NKI-AI / direct

Deep learning framework for MRI reconstruction
https://docs.aiforoncology.nl/direct
Apache License 2.0
228 stars 40 forks source link

How to debug code in this direct framework? #259

Closed string-ellipses closed 7 months ago

string-ellipses commented 7 months ago

I'm always frustrated when debugging code within this code framework. For example, when I insert import pdb; pdb.set_trace() into the code in recurrentvarnet.py to inspect some variables, I get the following error, which seems to be a conflict between pdb and the multiprocessing module of torch. I would like to know how to debug this multi-process code effectively. Here are some relevant error messages:


/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/recurrentvarnet/recurrentvarnet.py(404)forward() -> kspace_error = torch.where( (Pdb) Traceback (most recent call last): File "/home/miniconda3/envs/score_SDE/bin/direct", line 33, in sys.exit(load_entry_point('direct==1.0.5.dev0', 'console_scripts', 'direct')()) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/cli/init.py", line 33, in main args.subcommand(args) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/train.py", line 302, in train_from_argparse launch( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/launch.py", line 218, in launch launch_distributed( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/launch.py", line 89, in launch_distributed mp.spawn( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/launch.py", line 174, in _distributed_worker main_func(args) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/train.py", line 270, in setup_train env.engine.train( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/engine.py", line 645, in train self.training_loop( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/engine.py", line 300, in training_loop iteration_output = self._do_iteration(data, loss_fns, regularizer_fns=regularizer_fns) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/mri_models.py", line 129, in _do_iteration output_image, output_kspace = self.forward_function(data) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/recurrentvarnet/recurrentvarnet_engine.py", line 39, in forward_function output_kspace = self.model( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 963, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/recurrentvarnet/recurrentvarnet.py", line 300, in forward kspace_prediction, previous_state = block( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/recurrentvarnet/recurrentvarnet.py", line 404, in forward kspace_error = torch.where( File "/home/miniconda3/envs/score_SDE/lib/python3.9/site-packages/direct-1.0.5.dev0-py3.9-linux-x86_64.egg/direct/nn/recurrentvarnet/recurrentvarnet.py", line 404, in forward kspace_error = torch.where( File "/home/miniconda3/envs/score_SDE/lib/python3.9/bdb.py", line 88, in trace_dispatch return self.dispatch_line(frame) File "/home/miniconda3/envs/score_SDE/lib/python3.9/bdb.py", line 113, in dispatch_line if self.quitting: raise BdbQuit bdb.BdbQuit .


Looking forward to and appreciating your response!

string-ellipses commented 7 months ago

I just use remote-pdb instead of pdb, and the problem has been solved.

jonasteuwen commented 7 months ago

Hi @hannah-zhangzz,

Great that you have found the solution. If you do debugging, often it's also a good idea to set the number of workers to 0 and, using one GPU when you try to isolate a problem is probably also helpful.

In pycharm or visual studio you can do remote debugging.

string-ellipses commented 7 months ago

Hi @hannah-zhangzz,

Great that you have found the solution. If you do debugging, often it's also a good idea to set the number of workers to 0 and, using one GPU when you try to isolate a problem is probably also helpful.

In pycharm or visual studio you can do remote debugging.

Thanks a lot for your advice. I will give it a try!