LLNL / ygm

Other
31 stars 22 forks source link

Parallel Parquet File Reading (v2) #188

Closed KIwabuchi closed 11 months ago

KIwabuchi commented 12 months ago

This Pull Request enables every MPI rank to read an equal number of lines, regardless of the differing line counts across multiple files. For instance, in a scenario where 100 lines of data are unevenly distributed among several files, and there are four MPI ranks, each rank will be responsible for reading 25 lines. Specifically, MPI rank 0 will process the first 25 lines, followed by MPI rank 1 with the subsequent 25 lines, and so on. This means that a file may be read in parallel by multiple ranks.

This new feature requires at least Arrow v14. If an older version is used, the reader falls back to the old mode, which assigns an equal number of files to each rank regardless of the number of lines in the files.