IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter

anrahman4 commented 3 months ago

After recent changes that were done to dlio_benchmark/utils/config.py , I am running into issues with a list index out of range when running certain numbers of parallel working with mpirun. I am able to successful runs with integer values that even divide 9375 (the value I have set to num_files_train), but does not work cleanly when the number divides into non-whole numbers.

Command:

mpirun -np 4 dlio_benchmark --config-dir /mnt/mlperf_stor/dlio_benchmark/dlio_benchmark/configs/ workload=custom_workload.yaml

Error Message

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 330, in run
    self.args.reconfigure(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/utils/config.py", line 382, in reconfigure
    self.train_global_index_map = self.get_global_map_index(self.file_list_train, self.total_samples_train)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/utils/config.py", line 361, in get_global_map_index
    abs_path = os.path.abspath(file_list[file_index])
IndexError: list index out of range

Here is my custom_workload.yaml:

custom_workload.yaml

model: unet3d

framework: pytorch

workflow:
  generate_data: False
  train: True
  checkpoint: False
  profiling: True

dataset:
  data_folder: /mnt/dlio_benchmark/dlio_benchmark/dlio_benchmark/configs/data
  format: npz
  num_files_train: 9375
  num_samples_per_file: 1
  record_length: 146600628
  record_length_stdev: 68341808

reader:
  data_loader: pytorch
  batch_size: 4
  read_threads: 4
  file_shuffle: seed
  sample_shuffle: seed
  shuffle_size: 4

train:
  epochs: 1
  computation_time: 0.323

checkpoint:
  checkpoint_folder: /mnt/dlio_benchmark/dlio_benchmark/dlio_benchmark/configs/checkpoints
  checkpoint_after_epoch: 5
  epochs_between_checkpoints: 2
  model_size: 499153191

metric:
  au: 0.90

I figured out the issue with the code that is currently listed under the main branch for dlio_benchmark. I went to the file in which the Python script was pointing at the error. The error came from the get_global_map_index function:

@dlp.log
    def get_global_map_index(self, file_list, total_samples):
        process_thread_file_map = {}
        num_files = len(file_list)
        if num_files > 0:
            samples_per_proc = int(math.ceil(total_samples/self.comm_size)) 
            start_sample = self.my_rank * samples_per_proc
            end_sample = (self.my_rank + 1) * samples_per_proc
            for global_sample_index in range(start_sample, end_sample):
                file_index = global_sample_index//self.num_samples_per_file
                abs_path = os.path.abspath(file_list[file_index]) 
                sample_index = global_sample_index % self.num_samples_per_file
                process_thread_file_map[global_sample_index] = (abs_path, sample_index)
            logging.debug(f"{self.my_rank} {process_thread_file_map}")
        return process_thread_file_map

What the code overall is trying to do here, is to divide up the number of samples amongst the amount of cores you have set using mpirun in the initial command, making a map to the number of files to indexes in Python. Total_samples is defined by your workload file as the value set in num_files_train in the dataset section of your custom_workload.yaml file.

The problem is when you specify the file number as say, 9375, it will indeed create that many files, but the very last file name if using zero based indexing will be 9374. So let’s take a look at an example where we run mpirun -np 4:

In this case 4 cores will be utilized and will be defined as ranks 0-3. The samples_per_proc variable will get calculated as ceiling(9375/4) = 2344 samples per processor. The code will define the start_sample and end_sample range that each rank is responsible for. For 9,375 files, the forward loop up top breaks down like this:

rank 0: for loop 0, 2344 => samples 0-2343 rank 1: for loop 2344, 4688 => samples 2344-4687 rank 2: for loop 4688, 7033 => samples 4688-7032 rank 3: for loop 7033, 9376 => samples 7033-9375

But for the last rank, in this case rank 3, the variable for end_sample is actually set to 9376, not 9375. If you look at the indexing, the indexing is supposed to look for index 9374 to access file number 9375, but it is instead trying to use index 9375 for 9375 which does not exist, hence the Python list index out of range.

The reason why you didn’t run into this bug when running np = 1 is because the samples_per_proc calculation just ends up being 9375 since self.comm_size = 1. Then when you go into the forward loop, it actually goes from 0 to 9374 in terms of indexing which is actually correct.

I believe this behavior has to do with the introduction of the ceiling function to calculate samples_per_proc, where the very last rank when more than one rank is being used will be incorrectly calculated if the num_files_train parameter is not divisible into a whole number by the concurrency set in mpirun. Using the MLPerf Storage commit version bc693c6 of this function seems to fix the issue initially and allows the first epoch to complete:

commit bc693c6

process_thread_file_map = {}
        for global_sample_index in range(total_samples):
            file_index = global_sample_index//self.num_samples_per_file
            abs_path = os.path.abspath(file_list[file_index]) 
            sample_index = global_sample_index % self.num_samples_per_file
            process_thread_file_map[global_sample_index] = (abs_path, sample_index)
        return process_thread_file_map

But after, I get this error message instead:

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 338, in run
    steps = self._train(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 259, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 222, in iter
    for v in func:
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/reader_handler.py", line 116, in read_index
    filename, sample_index = self.global_index_map[global_sample_idx]
KeyError: 9375

Looks like in the main branch, the DataLoader then also has a key error, where it is trying to look at key 9375 as opposed to 9374.

Please confirm if this is truly the issue and fix the relevant files to have the global map create a proper index map. Thank you.

hariharan-devarajan commented 3 months ago

@anrahman4 Thanks for reporting. I will look into this.

hariharan-devarajan commented 3 months ago

@anrahman4 Please check #226 and see if it solves your problem.

anrahman4 commented 3 months ago

@hariharan-devarajan

Pulled the commit and ran the same command:

mpirun -np 4 dlio_benchmark --config-dir /mnt/mlperf_stor/dlio_benchmark/dlio_benchmark/configs/ workload=custom_workload.yaml

Still received the following error in dlio_benchmark/reader/reader_handler.py:

Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
    benchmark.run()
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 338, in run
    steps = self._train(epoch)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 259, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 222, in iter
    for v in func:
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
    x = func(*args, **kwargs)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
  File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/reader_handler.py", line 116, in read_index
    filename, sample_index = self.global_index_map[global_sample_idx]
KeyError: 9375

hariharan-devarajan commented 3 months ago

@anrahman4 Thank you for your testing. It looks like the PyTorch sampler went overboard. :( It should be correct now. Can you retest and confirm? I tried with four epochs, and it works with your configuration.

anrahman4 commented 3 months ago

@hariharan-devarajan That did the trick :)

Was able to to run the mpirun command with np as 1, 2, 3, 4, 5, 6, 7, 8

Thank you for the very prompt fix!

argonne-lcf / dlio_benchmark

IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter #225