NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
456 stars 55 forks source link

Running into OOM with add id #142

Open yyu22 opened 2 months ago

yyu22 commented 2 months ago

Describe the bug

Running the add id module of curator runs into ooms even with small batch size, e.g., 32. The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size. Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory

Jun 25 13:48:18.323459 942129 slurmstepd   0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd   0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd   0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0

some observations:

cpu-00009              total        used        free      shared  buff/cache   available
Mem:            176          14         144           0          18         160
cpu-00042              total        used        free      shared  buff/cache   available
Mem:            176          79          88           0           8          94
cpu-00046              total        used        free      shared  buff/cache   available
Mem:            176         113          61           0           2          61
cpu-00082              total        used        free      shared  buff/cache   available
Mem:            176          13         145           0          17         160
cpu-00050              total        used        free      shared  buff/cache   available
Mem:            176          74          78           0          23          99
cpu-00019              total        used        free      shared  buff/cache   available
Mem:            176          72          38           0          65         101
cpu-00087              total        used        free      shared  buff/cache   available
Mem:            176          55         106           0          15         119
cpu-00086              total        used        free      shared  buff/cache   available
Mem:            176          90          80           0           6          84
cpu-00020              total        used        free      shared  buff/cache   available
Mem:            176          36         101           0          39         138
cpu-00002              total        used        free      shared  buff/cache   available
Mem:            176         156           2           0          17          18

$ grep -A 1 'cpu-00082' ./log.txt | grep 'Mem:' Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 145 0 17 160 Mem: 176 13 144 0 17 160 Mem: 176 14 144 0 17 160 Mem: 176 14 144 0 17 160 Mem: 176 14 144 0 17 160


- cpu utilization is very low.

- Setting the `start-index` argument slows down the code

- IO speed decreasing over time

**Steps/Code to reproduce bug**
batch_index = 0
for files in get_batched_files(data_path, id_data_path, "jsonl", batch_size=128):
    dataset = DocumentDataset.read_json(files, add_filename=True)
    print("Done reading dataset")
    add_id = AddId(
        id_field='id',
        id_prefix=f"rpv2-{batch_index}",
        )
    print("Start adding id")
    id_dataset = add_id(dataset)
    print("Done adding id")
    id_dataset.to_json(id_data_path, write_to_filename=True)
    batch_index += 1
ryantwolf commented 1 month ago

The original OOM error is due to not properly limiting the number of workers based on memory.

Memory usage is extremely unbalanced across the nodes

I'm not sure about this one, but if I had to guess it would be that your batch size you've created is too small to use all workers available.

cpu utilization is very low.

Yes, add_id in general does not use much CPU and is heavily IO-bound.

Setting the start-index argument slows down the code

This is expected.

IO speed decreasing over time

Not sure about this one.

glam621 commented 3 weeks ago

Follow up with @yyu22 and @ryantwolf