DataLoader is very slow when using SubjectsDataset

ivezakis commented 2 years ago

Is there an existing issue for this?

[x] I have searched the existing issues

Problem summary

When using SubjectsDataset with PyTorch dataloader, iterating over the dataloader is incredibly slow. Naturally, this slows training down as well. When iterating over the SubjectsDataset however, it is significantly faster.

In my experience, starting to iterate over Subjects dataset takes a few seconds (<10), while for dataloader to begin, it takes more than a minute.

Code for reproduction

f = open("some_data.json")
jdata = json.load(f)

subjects = tio.SubjectsDataset(
    [
        tio.Subject(
            img=tio.ScalarImage(subject["image"]),
            labels=tio.LabelMap(subject["label"])
        )
        for subject in jdata
    ]
)

for sample in subjects:
    print(sample)
    break

loader = torch.utils.data.DataLoader(subjects, batch_size=8, shuffle=True)
for sample in loader:
    print(sample)
    break

Actual outcome

Iterating over loader is much more slow than iterating over subjects.

Error messages

No response

Expected outcome

Performance should be similar.

System info

pytorch 1.12.0
torchio 0.18.83

On a machine with Ubuntu 20.04, NVMe SSD

fepegar commented 2 years ago

Hi, @ivezakis. You are using one process to load 8 images, so it will be 8 times slower. This is expected. To make it faster, you should use a num_workers larger than 1.

ivezakis commented 2 years ago

Hi @fepegar, in fact I am using the maximum number of workers for my machine in the dataloader, num_workers = 12. Sorry that wasn't accurate on the code I provided.

Please consider re-opening this. The difference is rather large in my experience. For a batch size of 8, it is over 40 times. Picture attached.

Edit: Also tried it with batch size one, it's 6.8 seconds vs 3.6.

QingYunA commented 1 year ago

Yes，I have meet the same problem with yours. it is very very slow!(at least 30 times than actually model traning time) but I don't have good ways to resolve it. Have you get any good method?

romainVala commented 1 year ago

hi can it be that you ask for too much numworker ? making num_worker equal to the number of core, may be too much (overload can really decrease performance). can you try different numworker (1/2 1/4 of you total num_worker) and report if you get the same difference ? (do not forget, as fepegar said, both are equivalent if time_dataloader = batch_size * time_dataset (because in dataset you get only a batch_size of 1)

fepegar commented 1 year ago

@ivezakis, @QingYunA

Can you please provide a minimal, reproducible example?

fepegar commented 1 year ago

@romainVala I've also noticed that behavior. For example, in a DGX with 40 cores, my code was fastest using only 12.

QingYunA commented 1 year ago

hi can it be that you ask for too much numworker ? making num_worker equal to the number of core, may be too much (overload can really decrease performance). can you try different numworker (1/2 1/4 of you total num_worker) and report if you get the same difference ? (do not forget, as fepegar said, both are equivalent if time_dataloader = batch_size * time_dataset (because in dataset you get only a batch_size of 1)

yes, after i increase the num_workers(16) of Queue, the speed of preparing dataloader get fast. By the way, i found the transform i used influence the speed. when i remove the RandomAffine(degrees=20), load time reduce half.

fepegar / torchio