flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.59k stars 218 forks source link

SampleIdxs creates large futures #1395

Open suranap opened 1 month ago

suranap commented 1 month ago

Problem

I'm running training with 64 nodes 8 GPUs local batch size 12 = 6144 global batch size. This exceeds the size of MAX_NUM_SAMPLES, currently set to 4k. if I increase that value to 16k (to handle 128+ nodes) I get an OutOfMemory error from Legion. This code creates futures that might be quite large. I'm told from Legion folks that ~1k is a reasonable max size for futures. if I reduce my batch size to fit in the original 4k sample size it works.

One question is, how big can a future be? Elliott (at Stanford) said that ~1k is a reasonable max size for futures. But obviously 4k works for FF.

Proposal

Right now SampleIdxs is a continuous range of numbers. It should be fine to only store the starting index and num_samples to define the range. No need to store [1,2,3,... 6144].

If you want to support shuffling, then you'll likely want to store a seed for a deterministic RNG and have each node calculate the samples assigned to it.