biocore / mds-approximations

Multidimensional scaling algorithms for microbiology-ecology datasets.
6 stars 7 forks source link

[WIP]: SVD #14

Closed HannesHolste closed 8 years ago

HannesHolste commented 8 years ago

SVD, delegating to scipy svd (LAPACK).

Open questions that need to be addressed:

@antgonza / @josenavas: 1) Is it correct to only return the real number part of the eigenvectors/eigenvalues? I'm not sure how the output will be used ie if you want to keep the imaginary part.

eigvecs = np.array(U.real)

2) For the unit test, I want to ensure that the returned coords, eigvals, percentages are correct. Jose suggested looking into unit tests in scipy/sklearn. I did that but I'm cautious about simply using sklearn.pca on the same distance matrix and then comparing my output with theirs using assertAlmostEqual, because they do some funky stuff to the eigenvectors: https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/decomposition/pca.py#L249

Any advice on how to properly unit test this, and the other methods I'm implementing? After a lot of reading and browsing I'm stuck on figuring out a good way to sanity-check my work.

3) For pcnts (percentage variation array), we can either represent them from 0.0 to 1.0 or 0-100. Is there a standard for this? The other methods like nystrom didn't calculate percentages at all so I couldn't follow the standard.

4) In general, do you want methods to return standard python arrays, or to return and pass around numpy arrays? (currently returning numpy arrays)

antgonza commented 8 years ago

Just real is fine. I'm not sure we should test the output as that's responsibility of the library, however we should check that we format that output correctly for our tools ...

antgonza commented 8 years ago

Expanding on my last reply:

  1. Yes, just real. I remember having this discussion over the actual implementation we currently have, you can follow here: https://github.com/biocore/scikit-bio/pull/90/files#diff-d5b2dfbdce30869718db88105cd9fdd2R18
  2. The biggest issue with unit testing is that depending on the hardware and software stack you are using the values/signs of the results might change. For example, from the same distance matrix we could get 2 flipped versions of the same plot in 2 different machines; this is a common thing during workshops. With this in mind, I would not test actual values from those methods coming from other libraries, just that we are parsing them correctly, and I would put really simple distance matrices for those other methods we are denovo implementing here.
  3. We like 0-100 and for those methods without values we simply add N/A. Some of these methods "improve" their performance by not calculating the pcnts or simply is not possible due to their algorithms.
  4. I think we have always return numpy arrays. I think this is fine.
ElDeveloper commented 8 years ago
  1. The biggest issue with unit testing is that depending on the hardware and software stack you are using the values/signs of the results might change. For example, from the same distance matrix we could get 2 flipped versions of the same plot in 2 different machines; this is a common thing during workshops. With this in mind, I would not test actual values from those methods coming from other libraries, just that we are parsing them correctly, and I would put really simple distance matrices for those other methods we are denovo implementing here.

I know this is just an example, but this was very cleanly solved by @Jorge-C in skbio: https://github.com/biocore/scikit-bio/blob/master/skbio/util/_testing.py#L248

On (May-26-16| 1:49), Antonio Gonzalez wrote:

Expanding on my last reply:

  1. Yes, just real. I remember having this discussion over the actual implementation we currently have, you can follow here: https://github.com/biocore/scikit-bio/pull/90/files#diff-d5b2dfbdce30869718db88105cd9fdd2R18
  2. The biggest issue with unit testing is that depending on the hardware and software stack you are using the values/signs of the results might change. For example, from the same distance matrix we could get 2 flipped versions of the same plot in 2 different machines; this is a common thing during workshops. With this in mind, I would not test actual values from those methods coming from other libraries, just that we are parsing them correctly, and I would put really simple distance matrices for those other methods we are denovo implementing here.
  3. We like 0-100 and for those methods without values we simply add N/A. Some of these methods "improve" their performance by not calculating the pcnts or simply is not possible due to their algorithms.
  4. I think we have always return numpy arrays. I think this is fine.

You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/biocore/mds-approximations/pull/14#issuecomment-221813599

antgonza commented 8 years ago

Good point @ElDeveloper, we should also add skbio to avoid duplication of code.

antgonza commented 8 years ago

One minor flake8 error.