NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478 stars 57 forks source link

[REVIEW] Switch Models to use Crossfit #58

Closed VibhuJawa closed 4 months ago

VibhuJawa commented 4 months ago

This PR enables using Crossfit.

Todo:

Benchmarks:

Subset Place Dataset Size Batch Size GPUs Time Taken Implementation Speedup Note
subset_CC-MAIN-2023-14_english 1.8 GB 1024** (dynamic) 16 V100 163 Crossfit 1 Batch size is dynamic
subset_CC-MAIN-2023-14_english 1.8 GB 256 (static) 16 V100 230 MainLine 1.41 Batch size is static, 512,1024 both OOM

image

A100 numbers:

image

Dataset Size (Mb) Resolution GPU Time (s) Model Speedup Batch Size
subset_CC-MAIN-2023-14_english 148 1024** 2 A100 50.558 Crossfit 1 Batch size is dynamic
subset_CC-MAIN-2023-14_english 148 1024** 2 A100 78.487 Mainline 1.55 Batch size is static
VibhuJawa commented 4 months ago

CC: @sarahyurick for an initial review.

ayushdg commented 4 months ago

Side note: NeMo-Curator requires all PR's to include both signed commits as well as signed-off. This can be enabled with git commit -sS .... There's some more info in https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md#pull-requests-pr-guidelines.

VibhuJawa commented 4 months ago

Side note: NeMo-Curator requires all PR's to include both signed commits as well as signed-off. This can be enabled with git commit -sS .... There's some more info in https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md#pull-requests-pr-guidelines.

Thanks, fixed it.

VibhuJawa commented 4 months ago

@sarahyurick , I have added the quality model and cleaned up scripts, Can i get a re-review please.

VibhuJawa commented 4 months ago

Really sorry for the commit noise guys (@ryantwolf , @sarahyurick ), i messed up one of the git rebase (forgot to sign off a commit).

This should be ready for another review now. Have addressed all the reviews and added issues to follow up ones that i could not get to in this PR.

VibhuJawa commented 4 months ago

@ryantwolf , Thanks again for all the careful reviews. Appreciate the help . I think i should have those resolved.