I'm running training with 64 nodes 8 GPUs local batch size 12 = 6144 global batch size. This exceeds the size of MAX_NUM_SAMPLES, currently set to 4k. if I increase that value to 16k (to handle 128+ nodes) I get an OutOfMemory error from Legion. This code creates futures that might be quite large. I'm told from Legion folks that ~1k is a reasonable max size for futures. if I reduce my batch size to fit in the original 4k sample size it works.
One question is, how big can a future be? Elliott (at Stanford) said that ~1k is a reasonable max size for futures. But obviously 4k works for FF.
Proposal
Right now SampleIdxs is a continuous range of numbers. It should be fine to only store the starting index and num_samples to define the range. No need to store [1,2,3,... 6144].
If you want to support shuffling, then you'll likely want to store a seed for a deterministic RNG and have each node calculate the samples assigned to it.
Problem
I'm running training with 64 nodes 8 GPUs local batch size 12 = 6144 global batch size. This exceeds the size of MAX_NUM_SAMPLES, currently set to 4k. if I increase that value to 16k (to handle 128+ nodes) I get an OutOfMemory error from Legion. This code creates futures that might be quite large. I'm told from Legion folks that ~1k is a reasonable max size for futures. if I reduce my batch size to fit in the original 4k sample size it works.
One question is, how big can a future be? Elliott (at Stanford) said that ~1k is a reasonable max size for futures. But obviously 4k works for FF.
Proposal
Right now SampleIdxs is a continuous range of numbers. It should be fine to only store the starting index and num_samples to define the range. No need to store [1,2,3,... 6144].
If you want to support shuffling, then you'll likely want to store a seed for a deterministic RNG and have each node calculate the samples assigned to it.