chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

Fix ml.ExperimentDataPipe for Distributed Training #1119

Closed JulesGM closed 3 weeks ago

JulesGM commented 2 months ago

The current form of ml.ExperimentDataPipe breaks in distributed training when the amount of samples isn't split between GPUs in a way that allows for each GPUs to have the same number of batches, as the mainstream ways to train models are synchronous (like DDP), & this causes some GPUs to wait at a barrier to a new epoch while other ones finish the last batch of the previous epoch.

My / our interpretation of the source of this problem is that this is because chunks are created on the indices before the data is split between the GPUs. In this PR, the indices are first split between the GPUs, & then the chunks are created from this data. This way of doing things minimizes the discrepancy in number of samples between GPUs to a maximum of 1 sample.

This would still break however, with a batch size of 1, which would cause some GPUs to have one more batch than others. To compensate this, we add the option to drop_last, to drop at maximum world_size - 1 samples so that each GPU have the same amount of data. if drop_last is false, the GPUs that are missing one data point pad their data with the first data point of their data. This is taking inspiration from https://github.com/pytorch/pytorch/blob/main/torch/utils/data/distributed.py#L83 and https://github.com/huggingface/accelerate/blob/v0.28.0/src/accelerate/data_loader.py#L345 (which do something similar, but for batch sizes)

A side effect of this is that the partition computation function becomes independent of the world size & rank, as it just works on it's current GPU's ids (which have been computed specifically for that GPU higher in the code).

This change seems like it preserves the behavior of shuffle, as the per-gpu chunks are still shuffled, and the per-gpu data is still shuffled, which makes the composition & the order of the composition of the batches pretty random.

JulesGM commented 2 months ago

@atolopko-czi @ebezzi

ebezzi commented 2 months ago

Thanks @JulesGM for submitting this PR! We're gonna review as soon as we can.

pablo-gar commented 2 months ago

Hi @JulesGM,

We've discussed in the team about this. The overall feature is something we are keen to support in the near future.

Your PR may be sufficient from an implementation side, and we would need to add some testing around it as well. We want to make sure we can understand the changes you are proposing and make a decision as to whether this implementation is what we'd like to use, then we'd implement testing.

If you are already using this code in a fork, something very useful for would be to have any data from you indicating that the implementation is working as desired in your hands.

atolopko-czi commented 2 months ago

@JulesGM I believe I have a similar but alternate solution for the imbalanced partition problem: https://github.com/chanzuckerberg/cellxgene-census/pull/1135. Unit tests are included in test_distributed__returns_data_partition_for_rank, which now exercises the imbalanced cases as well. I've also added test_distributed__returns_data_partition_for_rank_globally_shuffled to demonstrate that global shuffling is maintained.

Unlike your PR however, It does not introduce a drop_last option or support padding. Since the partition sizes can only differ by 1, I'm not sure that the differences in training time per partition would differ enough to justify adding this, but please correct me if that's not the case.

As it turns out, the previous implementation's "worst case" imbalances were actually much worse than I expected, but #1135 fixes that as well, of course.

Your PR, if I've interpreted it correctly, won't perform a global shuffling of the soma chunks, as it seems to be shuffling chunks only within each distributed partition. This would have a negative impact on the randomness of the data processed by each partition. In any case, the PR I've submitted does not suffer from this, as global chunk shuffling is maintained.

If you have time to give this a review, and possibly even try it out, your feedback would be greatly appreciated!

JulesGM commented 2 months ago

@atolopko-czi The number of batches cannot be different between the GPUs. If it is, it breaks distributed training, it crashes, full stop. So if there's a difference of one single sample and the batch size is one, it breaks distributed training. So you absolutely need a mechanism to even out, at the very least the number of batches.

JulesGM commented 2 months ago

With regards to randomness, one has to remember that moving data between GPUs doesn't affect the order the data is seen, as the gradients are averaged between the machines. Moving the sample from GPU 1 to GPU 7 doesn't matter, what matters is at what time the gradient from the data point is seen, eg is it seen in global batch one or in global batch 18 (& the composition of the batches). This PR does achieve that.

atolopko-czi commented 2 months ago

@atolopko-czi The number of batches cannot be different between the GPUs. If it is, it breaks distributed training, it crashes, full stop. So if there's a difference of one single sample and the batch size is one, it breaks distributed training. So you absolutely need a mechanism to even out, at the very least the number of batches.

Understood, and thank you! We'll ensure the final solution provides this.

With regards to randomness, one has to remember that moving data between GPUs doesn't affect the order the data is seen, as the gradients are averaged between the machines. Moving the sample from GPU 1 to GPU 7 doesn't matter, what matters is at what time the gradient from the data point is seen, eg is it seen in global batch one or in global batch 18 (& the composition of the batches). This PR does achieve that.

In separate work, our team is addressing some issues with the existing ExperimentDataPipe shuffling algo, since it has shown poor results on training loss curves in at least one case. The take away from that work is that we do believe that Census data needs to be shuffled across the global input, in order to properly mix rows from the individual datasets that comprise the Census. Consider that the Census has datasets the differ greatly in size (cell count), and so if distributed partitioning is done before global shuffling, it would be possible for a single distributed partition to represent only one or a few datasets. So how individual rows are assigned to GPUs is a separate issue, would you agree? Let us know if you have further thoughts on this.

JulesGM commented 2 months ago

so if distributed partitioning is done before global shuffling, it would be possible for a single distributed partition to represent only one or a few dataset

Isn't is more an artifact of the chunk abstraction than of post-GPU partitioning shuffle? The chunk abstraction severely limits the shuffling that can happen. It makes it so data in close batches are related (same dataset).

My point in the previous post is that shuffling in parallel on data specific to each GPU is extremely close to global shuffling, because a point can be part of the first global batch or the last global batch or of any global batch in between, even when just shuffling per-GPU, even with that shuffling mode, & a point being on one GPU or another makes no difference from the point of gradients, as the they are averaged, the only thing that matters is which that which global batch they end up on varies. It is still correlated with it's own chunk, but you can't run away from that.

pablo-gar commented 1 month ago

@JulesGM Thanks for all the information, we have been discussing this internally. We will make fixing the distributed-computing a priority soon after we fix other issues with the loaders.

We want to spend some cycles on defining the best implementation specially in the context of our shuffling approach. Once we have a desired path forward we'll assess how much we can re-use from this PR, so we will keep this open for now.

Thanks for the guidance.

github-actions[bot] commented 3 weeks ago

This PR has not seen any activity in the past 4 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

github-actions[bot] commented 3 weeks ago

This PR was closed because it has been inactive for 31 days, 3 days since being marked as stale. Please re-open if you still need this to be addressed.