Test data shows that when we have 40K partitions (200 mappers & 200 reducers) and small partition size (around 256K), the shuffle manager took almost half the total time retrieving the index file.
Reading 40000 partitions with 8 threads 99% (39991/40000)
Read shuffle data completed in 19049 milliseconds
Reading index file: 10187 ms
storage factory: com.memverge.splash.shared.SharedFSFactory
number of mappers: 200
number of reducers: 200
total shuffle size: 3GB
bytes written: 3GB
bytes read: 3GB
number of blocks: 64
blocks size: 256KB
partition size: 81KB
concurrent tasks: 8
bandwidth: 167MB/s
Due to the fact that comparing to the shuffle data, shuffle indices are relatively small, we could try the best to cache them in memory to eliminate this overhead.
For example, in the configuration described above, each node will only cache:
8 201 200 = 320K
8 is the length of long. 201 is the number of partition offsets. 200 is the number of the index files.
When memory is not enough, we should fallback to retrieve the data from the file system.
Test data shows that when we have 40K partitions (200 mappers & 200 reducers) and small partition size (around 256K), the shuffle manager took almost half the total time retrieving the index file.
Due to the fact that comparing to the shuffle data, shuffle indices are relatively small, we could try the best to cache them in memory to eliminate this overhead.
For example, in the configuration described above, each node will only cache: 8 201 200 = 320K 8 is the length of long. 201 is the number of partition offsets. 200 is the number of the index files.
When memory is not enough, we should fallback to retrieve the data from the file system.