Closed chhotii-alex closed 9 months ago
Also, note that the Ray documentation recommends against doing .get()
on the entire list of futures at once, as this can also be a potential source of memory issues:
https://docs.ray.io/en/latest/ray-core/patterns/ray-get-too-many-objects.html
However, notice that the example code that they give may process results out of order. This is An Issue for generating the similarity matrix: if the chunks are in the wrong order, the similarity matrix will be wrong.
The remote function should return the chunk index so we can sort the returned chunks back into chunk-index order.
When dealing with a large dataset (about 77,000 species) I get this:
And the job slowed down quite a bit (taking ~50% longer than I expected based on expected from extrapolating from a smaller dataset-- It took ~3 hours for a dataset 1/2 the size, so I expected doubling the dataset size to increase the run time to 12 hours, but it took 18 hours)
I believe this is because all the chunks are pinned in memory until
get(futures)
is called in theSimilarityFromFunction.weighted_similarities
. We ask for all the chunks to be created in one shot, and thenget
them all in one shot.See https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html for discussion of how perhaps we could be dealing with this better.