Closed billzhonggz closed 2 years ago
Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation
Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation
I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.
Alright, if possible can you share your data loading modules? My program will load the train data, but while preconputing the masks, it will throw an access violation error related to multiprocessing.
On Wed, Oct 19, 2022 at 23:09 Junru Zhong @.***> wrote:
Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation
I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.
— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1284168000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODUEL7J57ZSTHYYVYGLWEAFKDANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>
Here is the error I am getting
Le mer. 19 oct. 2022 à 23:09, Junru Zhong @.***> a écrit :
Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation
I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.
— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1284168000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODUEL7J57ZSTHYYVYGLWEAFKDANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>
Hi, have you found a solution to your problem, I am facing similar issues related to multiprocessing. I am windows fatal exception: access violation
I bypassed the issue by using a GPU with a larger VRAM. I tried to fix it before but it was too much work.
Hi, here my error message:
2022-10-20 09:32:17,513 - Formatting dataset dicts takes 0.05 seconds 2022-10-20 09:32:17,514 - Dropped 0 scans. 86 scans remaining 2022-10-20 09:32:17,515 - Dropped references for 0/86 scans. 86 scans with reference remaining 2022-10-20 09:32:18,152 - Loading D:/files_recon_calib-24/annotations\val.json takes 0.00 seconds 2022-10-20 09:32:18,193 - Formatting dataset dicts takes 0.04 seconds 2022-10-20 09:32:18,193 - Dropped 0 scans. 33 scans remaining 2022-10-20 09:32:18,194 - Dropped references for 0/33 scans. 33 scans with reference remaining Precomputing masks: 0%| | 0/1 [00:00<?, ?it/s]Windows fatal exception: access violation | 1/12 [00:00<00:07, 1.45it/s]
Thread 0x00002bf4 (most recent call first): File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 300 in wait File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 552 in wait File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\tqdm_monitor.py", line 60 in run File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 926 in _bootstrap_inner File "C:\Users\Alou\anaconda3\envs\research\lib\threading.py", line 890 in _bootstrap
Current thread 0x00006714 (most recent call first):
File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\sigpy\mri\samp.py", line 66 in poisson
File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\meddlr\data\transforms\subsample.py", line 176 in call
File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\skm_tea\data\transform.py", line 527 in _precompute_mask
File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\skm_tea\data\transform.py", line 153 in precompute_masks
File "C:\Users\Alou\Downloads\MoDL\Data\data_module.py", line 173 in _make_eval_datasets
File "C:\Users\Alou\Downloads\MoDL\Data\data_module.py", line 68 in setup
File "C:\Users\Alou\anaconda3\envs\research\lib\site-packages\pytorch_lightning\core\datamodule.py", line 92 in wrapped_fn
File "data.py", line 44 in init
File "data.py", line 69 in
Thanks for the question @billzhonggz - apologies for the delay. I am working on a fix in both meddlr and skm-tea that will support multi-gpu training. More information in PR #22
I'll provide an updated command once that PR is merged in
Thanks for the question @billzhonggz - apologies for the delay. I am working on a fix in both meddlr and skm-tea that will support multi-gpu training. More information in PR #22
I'll provide an updated command once that PR is merged in
Thanks for the update! Let me close the issue for now and test it later. I will re-open the issue if I have any follow up.
In case it's helpful - adding some pointers below:
# Update meddlr
pip install --upgrade meddlr
# Train with 2 gpus, 4 workers per process for data loading, training/test batch size of 2
python tools/train_net.py --debug --config-file <your-config-file> --num-gpus=2 DATALOADER.NUM_WORKERS 4 SOLVER.TRAIN_BATCH_SIZE 2 SOLVER.TEST_BATCH_SIZE 2
Some tips with smaller gpus:
cfg.TEST.FLUSH_PERIOD = -1
Thanks, I'll check it out.
On Fri, Oct 21, 2022 at 11:02 Arjun Desai @.***> wrote:
In case it's helpful - adding some pointers below:
Update meddlr
pip install --upgrade meddlr
Train with 2 gpus, 4 workers per process for data loading, training/test batch size of 2
python tools/train_net.py --debug --config-file
--num-gpus=2 DATALOADER.NUM_WORKERS 4 SOLVER.TRAIN_BATCH_SIZE 2 SOLVER.TEST_BATCH_SIZE 2 Some tips with smaller gpus:
- Make sure your machine has enough RAM to support your dataloader
- If you run into dataloader issues, try changing the number of workers
- If GPU is running out of space during validation/testing, use cfg.TEST.FLUSH_PERIOD = -1
— Reply to this email directly, view it on GitHub https://github.com/StanfordMIMI/skm-tea/issues/20#issuecomment-1286396151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFOVODSFAD43NXAU3ZZF4BLWEIBVFANCNFSM6AAAAAAQP2YEC4 . You are receiving this because you commented.Message ID: @.***>
I tried to run the code by
tools/train_net.py
script, with the configconfigs/baselines-neurips/dicom-track/seg/unet.yaml
. Since I don't have one GPU with 24GB+ VRAM, I tried to run the code on 2 GPUs with 12GB VRAM each.I set the argument
--num-gpus=2
when starting the program. But I got an error about "process group not initialized" when the program ran tobuild_sampler
.https://github.com/StanfordMIMI/skm-tea/blob/58ec1454c989c838a25956900252ea4b6eb383dd/skm_tea/data/data_module.py#L78
Here is the trace stack.
To be clear, I changed many lines from the source code to skip some exceptions that I believe come from the API change. They are about logging and profiling. I think those changes should not relate to multiprocessing.
My environment and configuration dump is,
And the full config from the log file is,
[09/19 15:07:13] skm_tea INFO: Running with full config:
I did some study on the source code of this repository and
meddlr
. To my understanding, the process group should be initialized inskm_tea/engine/modules/base.py
file. But I saw many TODOs in this file about multiprocessing.I am trying hard to fix the bug by adding
init_process_group()
to the file mentioned above. Would you also investigate the issue or suggest me to build an environment that guarantee to work?