tio.Queue() got stuck when num_workers>1

rhan93 commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Problem summary

I am running a pach-based training using tio.Queue. When I set num_workers=0, all works fine. However, with num_workers>1, I was able to run it on one machine, but failed on another machine with a different cpu. The script stuck at the line of tio.Queue without stopping or outputting an error message. Any ideas what is causing this behavior? Thank you.

I'm using Python 3.6.5, torch 1.10.2, and torchio 0.18.76

Code for reproduction

patches_training_set = tio.Queue(
        subjects_dataset=dataset,
        max_length=args.max_queue_length,
        samples_per_volume=args.samples_per_volume,
        sampler=sampler,
        num_workers=args.num_workers,
        shuffle_subjects=True,
        shuffle_patches=True,
    )

loader = torch.utils.data.DataLoader(
        patches_training_set, batch_size=args.batch, num_workers=0)



### Actual outcome

getting stuck in patches_training_set = tio.Queue()

### Error messages

_No response_

### Expected outcome

N/A

### System info

_No response_

fepegar commented 2 years ago

Hi, @rhan93. Can you please provide the system info, as requested in the issue template?

NB-UCLouvain commented 2 years ago

Hello, I am currently experiencing the same problem regardless of the number of workers. I work on a cluster with an AMD 64 cores CPU and a nVidia A100 40gb GPU.

Working with a multimodal network, I have to create several queues for each dataset and the code gets stuck in the last queue every time. This means that is I comment the two validation queue, it will be stuck on the second training queue.

# Creating queue to draw patches from
print("Creating queue")
train_patches_queue_ct = tio.Queue(
     train_dataset_ct,
     max_length=40,
     samples_per_volume=5,
     sampler=sampler,
     num_workers=4,
     shuffle_subjects=False,
     shuffle_patches=False,
     verbose=True
    )
print("train_patches_queue_ct - DONE")

train_patches_queue_mr = tio.Queue(
     train_dataset_mr,
     max_length=40,
     samples_per_volume=5,
     sampler=sampler,
     num_workers=4,
     shuffle_subjects=False,
     shuffle_patches=False,
     verbose=True
    )
print("train_patches_queue_mr - DONE")

val_patches_queue_ct = tio.Queue(
     val_dataset_ct,
     max_length=40,
     samples_per_volume=5,
     sampler=sampler,
     num_workers=4,
     shuffle_subjects=False,
     shuffle_patches=False
    )
print("val_patches_queue_ct - DONE")

val_patches_queue_mr = tio.Queue(
     val_dataset_mr,
     max_length=40,
     samples_per_volume=5,
     sampler=sampler,
     num_workers=4,
     shuffle_subjects=False,
     shuffle_patches=False
    )
print("val_patches_queue_mr - DONE")

# Define train and val loader
print("Define train and val loader")
batch_size = 1
train_loader_ct = torch.utils.data.DataLoader(train_patches_queue_ct, batch_size=batch_size, num_workers=0)
train_loader_mr = torch.utils.data.DataLoader(train_patches_queue_mr, batch_size=batch_size, num_workers=0)
val_loader_ct = torch.utils.data.DataLoader(val_patches_queue_ct, batch_size=batch_size, num_workers=0)
val_loader_mr = torch.utils.data.DataLoader(val_patches_queue_mr, batch_size=batch_size, num_workers=0)

I got from this code:

Creating subjects loader with 4 workers
train_patches_queue_ct - DONE

Creating subjects loader with 4 workers
train_patches_queue_mr - DONE
val_patches_queue_ct - DONE

Then the script never stop.

fepegar commented 2 years ago

Hi, @NB-UCLouvain. Feel free to report your issue: https://github.com/fepegar/torchio/issues/new?assignees=&labels=&template=not_working.yml

fepegar / torchio