MemVerge / splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Apache License 2.0
127 stars 29 forks source link

Cache shuffle index in memory #51

Closed jealous closed 5 years ago

jealous commented 5 years ago

Test data shows that when we have 40K partitions (200 mappers & 200 reducers) and small partition size (around 256K), the shuffle manager took almost half the total time retrieving the index file.

Reading 40000 partitions with 8 threads   99% (39991/40000)
Read shuffle data completed in 19049 milliseconds
    Reading index file:  10187 ms
    storage factory:     com.memverge.splash.shared.SharedFSFactory
    number of mappers:   200
    number of reducers:  200
    total shuffle size:  3GB
    bytes written:       3GB
    bytes read:          3GB
    number of blocks:    64
    blocks size:         256KB
    partition size:      81KB
    concurrent tasks:    8
    bandwidth:           167MB/s

Due to the fact that comparing to the shuffle data, shuffle indices are relatively small, we could try the best to cache them in memory to eliminate this overhead.

For example, in the configuration described above, each node will only cache: 8 201 200 = 320K 8 is the length of long. 201 is the number of partition offsets. 200 is the number of the index files.

When memory is not enough, we should fallback to retrieve the data from the file system.