KoslickiLab / DiversityOptimization

Minimizing biological diversity to improve microbial taxonomic reconstruction
MIT License
0 stars 0 forks source link

Large matrices and scipy.optimize.nnls not working with sparse matrices #14

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

From Slack: Chase Colbert 1 day ago Hi @David Koslicki, over the weekend I tried running tests on the first mock community with larger k sizes. However, even natively, there were memory problems with small_k greater than or equal to 8. I realized that this was because I convert A_k_small to a dense matrix when I pass to MinDivLP . However, it seems that scipy.optimize.nnls is only able to accept dense matrices, so I wonder if we may have to write our own implementation of the Lawson-Hanson algorithm in order to accept larger small_k values. Did you have this trouble in the Python implementation of Quikr?

David Koslicki 23 hours ago Hmmm...., I wondered if that may become a problem. When I implemented Quikr, I did it in two languages: Matlab and Julia, so I hadn't run into that problem before. I see a few options here:

  1. Use the low-level BLAS functions to mimic the Julia implementation of NNLS but in python
  2. See if any other underlying np.__config__ leads to better performance
  3. Try to find a sparse implementation of nnls (like this one from this paper)
  4. See if lsq_linear (which does accept sparse matrices) returns similar results to nnls. See this thread for a relevant discussion. I would start with #4 as that's probably the easiest, then #3 (the tsnnls implementation, but this would take a bit of work since it's written in C, and then would need to be wrapped with something like ctypes in order to be callable via python, though a workaround would be to save the matrices, subprocess.run the C code, and then read them back in)

Chase Colbert 20 hours ago Taking a look at the thread you linked for #4, I wonder if the implementation the original poster, alexfields, provided would be better since it follows the Lawson-Hanson algorithm as the scipy.optimize.nnls does. Or could it be that lsq_linear may still give similarly sparse solutions?

David Koslicki < 1 minute ago The sparsity of the solutions is definitely something that needs to be investigated/tested. Probably easiest to do this with lsq_linear since it's a simple drop in and test. The issue with the alexfields solution is that it requires A_dot_A, which will/could be massive in our case (but if it works, it might be ok due to the sparse storage of such A's).

cmcolbert commented 4 years ago

lsq_linear appears to be unable to handle small_k > 6 since it tries to allocate a large numpy zero array, and the function provided by alexfields was taking longer than 30 minutes to complete the reconstruction for mock community 1 with large_k = 12, small_k = 4, q = 0.1, const = 1000. In light of this, I am working on option number 3 using tsnnls to see if it can improve speed.

cmcolbert commented 4 years ago

I have been looking for some ways to possibly supplant the Lawson-Hanson algorithm since some tout better times with other nonnegative least squares algorithms. fnnls appears to be a good candidate, and although it may require some adjustment for sparse input, it is at least implemented in Python. I'll see if I can set up a comparison.