biocore / DEICODE

Robust Aitchison PCA from sparse count data
Other
33 stars 17 forks source link

"variation of interest" seems to be split across axis 2 instead of axis 1, as expected? #52

Closed fedarko closed 4 years ago

fedarko commented 4 years ago

Other people in the lab and I have observed this with multiple pretty distinct datasets on DEICODE.

We discussed this this morning -- this might be an artifact of the eigenvalues? produced by scipy's SVD method being flipped, somehow (either before, during, or after deicode/rpca.py calls that function).

One of the datasets this happens with, weirdly enough, is the sleep apnea dataset used in the DEICODE paper. We'd expect the sample metadata value exposure_type (two values: Air or IHH) to explain most of the sample variation in the biplot, but the corresponding "separation" seems to be split mostly along Axis 2 now (not Axis 1, as was shown in the paper). See biplot below:

biplot

For reference, this was generated using DEICODE 0.2.3 in a QIIME 2019.10 environment (using scipy 1.3.1). Can send over the QZV file (which contains provenance / version info), if you'd like.

As another way of looking at this: in Qurro, taking the log-ratio of the top 30 to bottom 30 features for PC 1 doesn't show much separation between Air and IHH exposure_types; however, for PC 2, the same log-ratio (but with different features, based instead on the PC 2 feature loadings) shows clear separation that seems to be analogous to what is shown in the DEICODE paper.

see below qurro plots for reference (sidenote, next version of Qurro will have controls to make doing these selections a lot easier... did this in Qurro 0.4.0 for reference.)

Axis 1:

axis1

Axis 2:

axis2

Lastly, looking in Qurro at the log-ratio of Coriobacteriaceae to Clostridium (shown in Figure 5 in the DEICODE paper) shows that features classified as belonging to these groups are extremely ranked (roughly similar to what's shown in Fig. 5, but flipped?) mostly for Axis 2, but not for Axis 1:

Axis 1: cc1

Axis 2: cc2

Axis 3: cc3

To me, it seems like somewhere along the line things started getting "flipped" or otherwise switched around somehow. The fact that the sleep apnea dataset only started experiencing this at a certain point should make it possible to test this -- the "brute force" testing method would just be going back to an old version of DEICODE and seeing if this problem is still present, and if not increasing the version until you find where the switch happened (could use something like binary search I guess :) We could theoretically automate this process, but taking the time to do that might be slower than just manual testing.

If that testing doesn't work (i.e. all versions of DEICODE show this problem???), we could try messing with the scipy version / other SVD parameters as we discussed this morning and seeing if that impacts things? Or this might be another bug from somewhere else.

...Also, this all presupposes that the paper figures' separation among PC 1 is what is the "correct" output. It could also be the case that those are wrong and the latest behavior is correct instead (but the whole consistently-separating-among-axis-2 thing is fishy, and makes me think that the current behavior is a reflection of a problem that was introduced somewhere).

mortonjt commented 4 years ago

Is this related to this issue? https://github.com/biocore/DEICODE/issues/32

fedarko commented 4 years ago

It might be. As I understand it the problem in #32 was that the eigenvalues were in increasing instead of decreasing order (?), and in the sleep apnea biplot produced by DEICODE things seem to be properly decreasing (although speaking off-hand these seem like pretty huge eigenvalues, I guess? at least compared to what I'm used to seeing in these sorts of files.)

deicodething

Perhaps the change to #32 may have caused this problem, or failed to account for it -- @mortonjt's comment here stands out to me as something that might be worth trying.

cameronmartino commented 4 years ago

I am a bit confused because of the following re-centering that relies on scipy's svd function

https://github.com/biocore/DEICODE/blob/b1f37059b79f6ca2e2db11ba2fb7500c1a92f87e/deicode/rpca.py#L9

which outputs already sorted in "non-increasing order". Therefore the argsort function should not change anything. @mortonjt do you think this could be caused by gesdd not converging?

https://github.com/biocore/DEICODE/blob/b1f37059b79f6ca2e2db11ba2fb7500c1a92f87e/deicode/rpca.py#L45-L58

cameronmartino commented 4 years ago

closing this as it should be addressed in PR #54. Reopen if you experience more issues.