ArnaoutLab / diversity

Partitioned frequency- and similarity-sensitive diversity in Python
MIT License
6 stars 1 forks source link

ray memory spillage when dealing with large dataset #61

Closed chhotii-alex closed 9 months ago

chhotii-alex commented 11 months ago

When dealing with a large dataset (about 77,000 species) I get this:

(raylet) Spilled 9707 MiB, 2 objects, write throughput 1088 MiB/s.

And the job slowed down quite a bit (taking ~50% longer than I expected based on expected from extrapolating from a smaller dataset-- It took ~3 hours for a dataset 1/2 the size, so I expected doubling the dataset size to increase the run time to 12 hours, but it took 18 hours)

I believe this is because all the chunks are pinned in memory until get(futures) is called in the SimilarityFromFunction.weighted_similarities. We ask for all the chunks to be created in one shot, and then get them all in one shot.

See https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html for discussion of how perhaps we could be dealing with this better.

chhotii-alex commented 11 months ago

Also, note that the Ray documentation recommends against doing .get() on the entire list of futures at once, as this can also be a potential source of memory issues: https://docs.ray.io/en/latest/ray-core/patterns/ray-get-too-many-objects.html However, notice that the example code that they give may process results out of order. This is An Issue for generating the similarity matrix: if the chunks are in the wrong order, the similarity matrix will be wrong. The remote function should return the chunk index so we can sort the returned chunks back into chunk-index order.