Closed anrahman4 closed 1 month ago
@anrahman4 Thanks for reporting. I will look into this.
@anrahman4 Please check #226 and see if it solves your problem.
@hariharan-devarajan
Pulled the commit and ran the same command:
mpirun -np 4 dlio_benchmark --config-dir /mnt/mlperf_stor/dlio_benchmark/dlio_benchmark/configs/ workload=custom_workload.yaml
Still received the following error in dlio_benchmark/reader/reader_handler.py:
Error executing job with overrides: ['workload=custom_workload.yaml']
Traceback (most recent call last):
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 386, in run_benchmark
benchmark.run()
File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
x = func(*args, **kwargs)
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 338, in run
steps = self._train(epoch)
File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
x = func(*args, **kwargs)
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/main.py", line 259, in _train
for batch in dlp.iter(loader.next()):
File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 222, in iter
for v in func:
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
for batch in self._dataset:
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
return self._process_data(data)
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
data.reraise()
File "/home/labuser/.local/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/labuser/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
x = func(*args, **kwargs)
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
return self.reader.read_index(image_idx, step)
File "/home/labuser/.local/lib/python3.10/site-packages/dftracer/logger.py", line 199, in wrapper
x = func(*args, **kwargs)
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
return super().read_index(image_idx, step)
File "/home/labuser/.local/lib/python3.10/site-packages/dlio_benchmark/reader/reader_handler.py", line 116, in read_index
filename, sample_index = self.global_index_map[global_sample_idx]
KeyError: 9375
@anrahman4 Thank you for your testing. It looks like the PyTorch sampler went overboard. :( It should be correct now. Can you retest and confirm? I tried with four epochs, and it works with your configuration.
@hariharan-devarajan That did the trick :)
Was able to to run the mpirun command with np as 1, 2, 3, 4, 5, 6, 7, 8
Thank you for the very prompt fix!
IndexError: list index out of range when running custom.yaml file with custom num_files_train parameter
After recent changes that were done to dlio_benchmark/utils/config.py , I am running into issues with a list index out of range when running certain numbers of parallel working with mpirun. I am able to successful runs with integer values that even divide 9375 (the value I have set to num_files_train), but does not work cleanly when the number divides into non-whole numbers.
Command:
Error Message
Here is my custom_workload.yaml:
custom_workload.yaml
I figured out the issue with the code that is currently listed under the main branch for dlio_benchmark. I went to the file in which the Python script was pointing at the error. The error came from the get_global_map_index function:
What the code overall is trying to do here, is to divide up the number of samples amongst the amount of cores you have set using mpirun in the initial command, making a map to the number of files to indexes in Python. Total_samples is defined by your workload file as the value set in num_files_train in the dataset section of your custom_workload.yaml file.
The problem is when you specify the file number as say, 9375, it will indeed create that many files, but the very last file name if using zero based indexing will be 9374. So let’s take a look at an example where we run mpirun -np 4:
In this case 4 cores will be utilized and will be defined as ranks 0-3. The samples_per_proc variable will get calculated as ceiling(9375/4) = 2344 samples per processor. The code will define the start_sample and end_sample range that each rank is responsible for. For 9,375 files, the forward loop up top breaks down like this:
rank 0: for loop 0, 2344 => samples 0-2343 rank 1: for loop 2344, 4688 => samples 2344-4687 rank 2: for loop 4688, 7033 => samples 4688-7032 rank 3: for loop 7033, 9376 => samples 7033-9375
But for the last rank, in this case rank 3, the variable for end_sample is actually set to 9376, not 9375. If you look at the indexing, the indexing is supposed to look for index 9374 to access file number 9375, but it is instead trying to use index 9375 for 9375 which does not exist, hence the Python list index out of range.
The reason why you didn’t run into this bug when running np = 1 is because the samples_per_proc calculation just ends up being 9375 since self.comm_size = 1. Then when you go into the forward loop, it actually goes from 0 to 9374 in terms of indexing which is actually correct.
I believe this behavior has to do with the introduction of the ceiling function to calculate samples_per_proc, where the very last rank when more than one rank is being used will be incorrectly calculated if the num_files_train parameter is not divisible into a whole number by the concurrency set in mpirun. Using the MLPerf Storage commit version bc693c6 of this function seems to fix the issue initially and allows the first epoch to complete:
commit bc693c6
But after, I get this error message instead:
Looks like in the main branch, the DataLoader then also has a key error, where it is trying to look at key 9375 as opposed to 9374.
Please confirm if this is truly the issue and fix the relevant files to have the global map create a proper index map. Thank you.