Open yyu22 opened 2 months ago
The original OOM error is due to not properly limiting the number of workers based on memory.
Memory usage is extremely unbalanced across the nodes
I'm not sure about this one, but if I had to guess it would be that your batch size you've created is too small to use all workers available.
cpu utilization is very low.
Yes, add_id
in general does not use much CPU and is heavily IO-bound.
Setting the start-index argument slows down the code
This is expected.
IO speed decreasing over time
Not sure about this one.
Follow up with @yyu22 and @ryantwolf
Describe the bug
Running the add id module of curator runs into ooms even with small batch size, e.g., 32. The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size. Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory
some observations:
$ grep -A 1 'cpu-00082' ./log.txt | grep 'Mem:' Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 144 0 17 160 Mem: 176 14 144 0 17 160 Mem: 176 14 144 0 17 160 Mem: 176 14 144 0 17 160