a-r-r-o-w / cogvideox-factory

Memory optimized finetuning scripts for CogVideoX using TorchAO and DeepSpeed
Apache License 2.0
421 stars 38 forks source link

Fast Dataloader #77

Open alfredplpl opened 2 weeks ago

alfredplpl commented 2 weeks ago

Feature request / 功能建议

The current Dataloader implementation in this repository is underperforming due to a lack of efficient parallelization. This often results in the CPU handling data preprocessing in a non-distributed manner, leaving the GPU underutilized during data loading stages. By improving parallel processing between the CPU and GPU, this proposal seeks to optimize data loading speed to enhance overall training efficiency.

Motivation / 动机

The main bottleneck caused by the Dataloader’s sequential data processing on the CPU results in idle GPU time, significantly extending training duration, especially on large datasets typical of video generation models. Addressing this bottleneck would allow the GPU to be more fully utilized, improving training speeds and resource efficiency. This suggestion is motivated by the need to optimize training performance.

Your contribution / 您的贡献

I am open to contributing to this optimization, either through a pull request or by sharing additional code references and resources. If maintainers support this initiative, I am available to assist in the implementation or testing phase.

a-r-r-o-w commented 2 weeks ago

This is indeed currently an issue which leads to longer duration per step, which I've noticed from the traces too. Definitely open for improvements and your help would be really valuable because we plan to add more training scripts for different models like Mochi, Allegro and upcoming video models, which could all benefit from speedups in data loading, and make things more modular for re-use.

Would you like to open a first draft pull request with modifications to the existing data loading scripts? We can start iterating on that together for improvements. Ideally, we can have the current implementations stay the same if the modifications required are many and add new dataloader classes that are parallel and faster. Let us know your thoughts and thank you for the suggestion!

alfredplpl commented 2 weeks ago

This process might be dependent on the tokenizer. Preprocessing for the tokenizer might be necessary.

alfredplpl commented 2 weeks ago
export OMP_NUM_THREADS=16

This environment variable was extremely helpful for parallelizing the tokenizer. Many people might not be aware of it.