ArnaoutLab / diversity

Partitioned frequency- and similarity-sensitive diversity in Python
MIT License
6 stars 1 forks source link

improve guidance on what technique to use to make similarity matrix #80

Closed chhotii-alex closed 10 months ago

chhotii-alex commented 10 months ago

The current "Advanced usage" section in the README makes it sound like there's almost no use case for passing in a similarity function. Only when the matrix doesn't fit on the hard drive?-- But people have many-terabyte drives now.

I would favor using a similarity function over a cvs file for a large dataset. It would be much faster (and we could be talking about weeks or months of compute time) because:

  1. similarity from function uses ray, which parallelizes the job of calculating all N x N entries (a huge win if there's 10 cores)
  2. using a .csv file requires both more I/O time (disk being much slower than RAM) and conversion from string format