DDP is a way to split the work by the GPU usage, and the data also should be splitted by GPU.
The split by GPU is already implemented, but then the total length should also be halved.
Motivation
The data halving is not supported in any version, thus the iteration is duplicated.
Data duplication could make the training to converge faster, but this is not intended.
Also, when using dataloader solely, it also halves the total data length.
Pitch
By the subject_sampler, the iterations_per_epoch will be modified by the data splitted, as well as the num_subjects.
Such that queue will iterate without any duplication.
Additional context
The subject_sampler of queue will acquire the list of index that the data needs to be splitted.
e.g. subject_sampler for GPU 0 = [0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ...]
subject_sampler for GPU 1 = [1, 2, 5, 6, 8, 9, 10, 11, 14, ...]
With current version, queue will give data by rank, but with the same size as the original.
e.g. subject_sampler for GPU 0 = [0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ..., 0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ...]
subject_sampler for GPU 1 = [1, 2, 5, 6, 8, 9, 10, 11, 14, ..., 1, 2, 5, 6, 8, 9, 10, 11, 14, ...]
subject_sampler already have splitted data well by GPU, so this list will be utilized for new subject_dataset that will halve length.
The code modification will be as the following:
# the length of subjects will be halved if subject_sampler exist
@property
def num_subjects(self) -> int:
if self.subject_sampler is not None: # this if statement is added
return len(self.subject_sampler)
return len(self.subjects_dataset)
# the subjects_dataset iteration will be selected using the subject_sampler list inputted
@property
def iterations_per_epoch(self) -> int:
if self.subject_sampler is not None: # this subjects_dataset modification is added
subjects_dataset = [self.subjects_dataset.dry_iter()[i] for i in self.subject_sampler]
else:
subjects_dataset = self.subjects_dataset.dry_iter()
total_num_patches = sum(
self._get_subject_num_samples(subject)
for subject in subjects_dataset # and call local, not global one
)
return total_num_patches
I have tested my self, with TorchIO version 0.19.1 and it halved the total data.
This is the original version:
And below is the modified version:
[EDIT]
Think the explanation was not enough so adding more:D
When subject_sampler is inputted to queue, iterable subject will be produced from def _get_subjects_iterable(self). This iterable subject is loaded and used in def _fill(self) and carried out until it reaches the length of self.subjects_dataset.
But when it comes to DDP, the data length in each GPU should be divided by the number of GPU usage. But the data for each GPU is already and efficiently divided by DistributedSampler, which is the subject_sampler input. So the total length used in def _fill(self) has to be equivalent to the length of DistributedSampler.
Meaning that,
current version of queue loads N subjects, and divide the data by the number of GPU K. And this queue iterates until the len(self.subjects_dataset), which is N.
š Feature
DDP is a way to split the work by the GPU usage, and the data also should be splitted by GPU. The split by GPU is already implemented, but then the total length should also be halved.
Motivation
The data halving is not supported in any version, thus the iteration is duplicated. Data duplication could make the training to converge faster, but this is not intended. Also, when using dataloader solely, it also halves the total data length.
Pitch
By the subject_sampler, the iterations_per_epoch will be modified by the data splitted, as well as the num_subjects. Such that queue will iterate without any duplication.
Additional context
The subject_sampler of queue will acquire the list of index that the data needs to be splitted. e.g. subject_sampler for GPU 0 = [0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ...] subject_sampler for GPU 1 = [1, 2, 5, 6, 8, 9, 10, 11, 14, ...]
With current version, queue will give data by rank, but with the same size as the original. e.g. subject_sampler for GPU 0 = [0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ..., 0, 3, 4, 7, 12, 13, 15, 16, 18, 25, ...] subject_sampler for GPU 1 = [1, 2, 5, 6, 8, 9, 10, 11, 14, ..., 1, 2, 5, 6, 8, 9, 10, 11, 14, ...]
subject_sampler already have splitted data well by GPU, so this list will be utilized for new subject_dataset that will halve length. The code modification will be as the following:
I have tested my self, with TorchIO version 0.19.1 and it halved the total data. This is the original version: And below is the modified version:
[EDIT] Think the explanation was not enough so adding more:D
When subject_sampler is inputted to queue, iterable subject will be produced from
def _get_subjects_iterable(self)
. This iterable subject is loaded and used indef _fill(self)
and carried out until it reaches the length ofself.subjects_dataset
.But when it comes to DDP, the data length in each GPU should be divided by the number of GPU usage. But the data for each GPU is already and efficiently divided by DistributedSampler, which is the subject_sampler input. So the total length used in
def _fill(self)
has to be equivalent to the length of DistributedSampler.Meaning that, current version of queue loads N subjects, and divide the data by the number of GPU K. And this queue iterates until the len(self.subjects_dataset), which is N.
But the queue of modified code iterates until the len(DistributedSampler), which is N//K.
You can check my pull request: https://github.com/fepegar/torchio/pull/1127