Closed rhan93 closed 2 years ago
Hi, @rhan93. Can you please provide the system info, as requested in the issue template?
Hello, I am currently experiencing the same problem regardless of the number of workers. I work on a cluster with an AMD 64 cores CPU and a nVidia A100 40gb GPU.
Working with a multimodal network, I have to create several queues for each dataset and the code gets stuck in the last queue every time. This means that is I comment the two validation queue, it will be stuck on the second training queue.
# Creating queue to draw patches from
print("Creating queue")
train_patches_queue_ct = tio.Queue(
train_dataset_ct,
max_length=40,
samples_per_volume=5,
sampler=sampler,
num_workers=4,
shuffle_subjects=False,
shuffle_patches=False,
verbose=True
)
print("train_patches_queue_ct - DONE")
train_patches_queue_mr = tio.Queue(
train_dataset_mr,
max_length=40,
samples_per_volume=5,
sampler=sampler,
num_workers=4,
shuffle_subjects=False,
shuffle_patches=False,
verbose=True
)
print("train_patches_queue_mr - DONE")
val_patches_queue_ct = tio.Queue(
val_dataset_ct,
max_length=40,
samples_per_volume=5,
sampler=sampler,
num_workers=4,
shuffle_subjects=False,
shuffle_patches=False
)
print("val_patches_queue_ct - DONE")
val_patches_queue_mr = tio.Queue(
val_dataset_mr,
max_length=40,
samples_per_volume=5,
sampler=sampler,
num_workers=4,
shuffle_subjects=False,
shuffle_patches=False
)
print("val_patches_queue_mr - DONE")
# Define train and val loader
print("Define train and val loader")
batch_size = 1
train_loader_ct = torch.utils.data.DataLoader(train_patches_queue_ct, batch_size=batch_size, num_workers=0)
train_loader_mr = torch.utils.data.DataLoader(train_patches_queue_mr, batch_size=batch_size, num_workers=0)
val_loader_ct = torch.utils.data.DataLoader(val_patches_queue_ct, batch_size=batch_size, num_workers=0)
val_loader_mr = torch.utils.data.DataLoader(val_patches_queue_mr, batch_size=batch_size, num_workers=0)
I got from this code:
Creating subjects loader with 4 workers
train_patches_queue_ct - DONE
Creating subjects loader with 4 workers
train_patches_queue_mr - DONE
val_patches_queue_ct - DONE
Then the script never stop.
Hi, @NB-UCLouvain. Feel free to report your issue: https://github.com/fepegar/torchio/issues/new?assignees=&labels=&template=not_working.yml
Is there an existing issue for this?
Problem summary
I am running a pach-based training using tio.Queue. When I set num_workers=0, all works fine. However, with num_workers>1, I was able to run it on one machine, but failed on another machine with a different cpu. The script stuck at the line of tio.Queue without stopping or outputting an error message. Any ideas what is causing this behavior? Thank you.
I'm using Python 3.6.5, torch 1.10.2, and torchio 0.18.76
Code for reproduction