ray memory spillage when dealing with large dataset

ArnaoutLab / diversity

Partitioned frequency- and similarity-sensitive diversity in Python

MIT License

6 stars 1 forks source link

When dealing with a large dataset (about 77,000 species) I get this:

(raylet) Spilled 9707 MiB, 2 objects, write throughput 1088 MiB/s.

And the job slowed down quite a bit (taking ~50% longer than I expected based on expected from extrapolating from a smaller dataset-- It took ~3 hours for a dataset 1/2 the size, so I expected doubling the dataset size to increase the run time to 12 hours, but it took 18 hours)

I believe this is because all the chunks are pinned in memory until get(futures) is called in the SimilarityFromFunction.weighted_similarities. We ask for all the chunks to be created in one shot, and then get them all in one shot.

See https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html for discussion of how perhaps we could be dealing with this better.

ArnaoutLab / diversity

ray memory spillage when dealing with large dataset #61