Yeast dataset not producing a scale-free network with GMMs

bentsherman commented 5 years ago

Apparently the Yeast network should be thresholded at ~0.85 to be scale-free, but right now KINC is thresholding Yeast at ~0.95, which produces a global network. Running KINC without GMMs still produces a threshold of 0.85, but enabling GMMs in some cases yields a 0.95 threshold. It is not yet clear if there is a difference between CPU / CUDA / OpenCL, or between 3.2.2 and 3.3.0, so we need to run all of these cases and identify which ones are correct and incorrect.

bentsherman commented 5 years ago

The data I have so far:

version  clusmethod  impl  threshold
3.2.2    none        cpu   0.85
3.2.2    none        gpu   0.85
3.2.2    gmm         cpu   -
3.2.2    gmm         gpu   0.95
3.3.0    none        cpu   0.85
3.3.0    none        gpu   0.85
3.3.0    gmm         cpu   -
3.3.0    gmm         gpu   0.95

The last two data points take longer to acquire but it seems like the higher threshold happens when GMMs are enabled. The fact that it happens with 3.2.2 as well tells me that RMT may be to blame. We changed RMT to reduce the multi-modal similarity matrix to use only one mode per gene pair, and we never verified it with Yeast, so I'm going to look into RMT again.

bentsherman commented 5 years ago

Here's an RMT log for the Yeast similarity matrix without GMMs:

thresh  prune   unique  chi
0.990   1   1   -1
0.989   1   1   -1
0.988   1   1   -1
0.987   1   1   -1
0.986   1   1   -1
0.985   1   1   -1
0.984   1   1   -1
0.983   1   1   -1
0.982   1   1   -1
0.981   1   1   -1
0.980   1   1   -1
0.979   1   1   -1
0.978   1   1   -1
0.977   1   1   -1
0.976   1   1   -1
0.975   1   1   -1
0.974   1   1   -1
0.973   1   1   -1
0.972   2   1   -1
0.971   2   1   -1
0.970   2   1   -1
0.969   2   1   -1
0.968   2   1   -1
0.967   2   1   -1
0.966   2   1   -1
0.965   3   1   -1
0.964   4   1   -1
0.963   4   1   -1
0.962   5   3   -1
0.961   5   3   -1
0.960   5   3   -1
0.959   5   3   -1
0.958   5   3   -1
0.957   8   3   -1
0.956   9   3   -1
0.955   9   3   -1
0.954   9   3   -1
0.953   11  5   -1
0.952   12  5   -1
0.951   13  5   -1
0.950   13  5   -1
0.949   15  6   -1
0.948   17  6   -1
0.947   17  6   -1
0.946   21  10  -1
0.945   21  10  -1
0.944   22  12  -1
0.943   25  12  -1
0.942   26  12  -1
0.941   26  12  -1
0.940   26  12  -1
0.939   29  14  -1
0.938   30  16  -1
0.937   33  16  -1
0.936   33  16  -1
0.935   33  16  -1
0.934   36  18  -1
0.933   39  20  -1
0.932   42  20  -1
0.931   44  20  -1
0.930   48  21  -1
0.929   49  23  -1
0.928   52  24  -1
0.927   55  24  -1
0.926   59  29  -1
0.925   62  30  -1
0.924   65  33  -1
0.923   69  34  -1
0.922   77  36  -1
0.921   80  38  -1
0.920   87  40  -1
0.919   91  43  -1
0.918   97  47  -1
0.917   99  47  -1
0.916   103 51  49.1739
0.915   106 52  61.3273
0.914   108 54  58.4966
0.913   113 60  53.6267
0.912   123 71  59.6175
0.911   127 78  58.2033
0.910   133 80  60.0903
0.909   136 84  61.5853
0.908   141 85  62.9752
0.907   150 94  53.224
0.906   152 96  66.8565
0.905   156 101 60.2791
0.904   161 104 68.39
0.903   168 108 58.4994
0.902   176 113 66.2343
0.901   184 121 69.185
0.900   188 126 63.696
0.899   193 132 65.4627
0.898   203 140 61.4581
0.897   208 143 73.7135
0.896   215 149 62.6007
0.895   221 157 62.0958
0.894   230 160 55.0021
0.893   239 172 65.0139
0.892   249 180 66.1747
0.891   256 183 68.6507
0.890   266 195 68.7294
0.889   277 201 73.3123
0.888   285 206 55.524
0.887   292 210 65.075
0.886   294 210 73.7646
0.885   301 213 71.7432
0.884   311 223 74.2462
0.883   322 231 66.0499
0.882   329 235 65.3956
0.881   333 238 67.238
0.880   339 242 71.1167
0.879   345 247 73.6388
0.878   352 255 76.0843
0.877   357 258 71.3061
0.876   366 268 73.1551
0.875   375 270 69.2696
0.874   387 286 75.245
0.873   398 292 74.1203
0.872   408 302 67.8335
0.871   422 311 75.9076
0.870   433 322 79.4866
0.869   444 327 86.4472
0.868   458 340 81.5034
0.867   469 347 79.1588
0.866   480 355 75.7115
0.865   494 372 82.2447
0.864   503 380 80.6523
0.863   510 387 75.792
0.862   523 396 91.2657
0.861   535 407 77.365
0.860   548 417 81.8684
0.859   567 429 107.292
0.858   579 438 117.022
0.857   594 448 95.9011
0.856   606 464 86.8465
0.855   623 480 122.236
0.854   640 491 106.851
0.853   648 504 134.658
0.852   659 517 131.346
0.851   670 526 116.934
0.850   691 547 132.743
0.849   709 572 139.193
0.848   723 581 133.358
0.847   740 598 146.253
0.846   756 612 151.105
0.845   769 617 158.802
0.844   787 631 169.441
0.843   799 649 179.234
0.842   820 663 200.29
0.856002

And with GMMs:

0.990   4613    61  390.136
0.989   4613    77  276.703
0.988   4615    99  263.587
0.987   4704    131 324.417
0.986   4725    148 368.266
0.985   4727    181 416.131
0.984   4728    211 384.704
0.983   4831    258 197.236
0.982   4836    288 108.219
0.981   4911    316 164.173
0.980   4917    355 145.899
0.979   4957    384 171.118
0.978   5021    440 218.089
0.977   5064    474 209.316
0.976   5110    530 223.776
0.975   5116    569 225.077
0.974   5158    603 221.971
0.973   5161    658 283.673
0.972   5218    703 248.193
0.971   5221    749 186.44
0.970   5223    798 165.969
0.969   5228    848 169.751
0.968   5232    889 145.797
0.967   5237    928 124.81
0.966   5244    969 131.912
0.965   5252    1038    135.666
0.964   5258    1080    95.1461
0.963   5262    1140    80.5096
0.962   5265    1188    74.3832
0.961   5267    1247    75.979
0.960   5271    1308    83.6358
0.959   5275    1366    68.6064
0.958   5284    1422    81.7791
0.957   5285    1443    79.7833
0.956   5293    1511    108.501
0.955   5305    1562    134.306
0.954   5315    1629    176.324
0.953   5324    1686    199.499
0.952   5332    1739    249.203
0.957

The biggest difference I see is that the matrix with GMMs immediately has a much larger prune matrix even at th=0.99. Some things to look into:

pairwise scatter plots
effect of other pairwise reduction methods (this was the main thing that we added to RMT to deal with large multi-modal matrices)
construct Yeast network with KINCv1 since this thresholding issue seems to go back before KINC 3.2.2

bentsherman commented 5 years ago

The pairwise reduction methods all yielded the same result more or less. Going to look at pairwise scatter plots...

bentsherman commented 5 years ago

Oh look at that...

corrdist

transcript:15S_rRNA_transcript:YDL133C-A_0

KINC's spearman code is identifying clusters that are perfectly flat (zero correlation) as being perfectly correlated. Kind of reminds me of Jordan's simulated data. Looks like we should have dealt with that edge case after all!

bentsherman commented 5 years ago

As a quick fix, I re-ran similiarity with --maxcorr 0.99, which essentially removed perfect correlations like the one shown above. Here is the resulting RMT log:

0.990   0   0   -1
0.989   16  1   -1
0.988   36  1   -1
0.987   149 3   -1
0.986   189 5   -1
0.985   218 9   -1
0.984   251 15  -1
0.983   389 31  -1
0.982   426 39  -1
0.981   535 47  -1
0.980   579 61  329.175
0.979   648 71  251.155
0.978   763 90  319.615
0.977   843 108 288.065
0.976   943 140 367.993
0.975   989 168 364.373
0.974   1081    189 380.286
0.973   1128    217 384.03
0.972   1234    257 397.712
0.971   1286    283 427.023
0.970   1353    328 325.621
0.969   1396    366 310.92
0.968   1451    409 273.73
0.967   1507    449 236.928
0.966   1551    472 234.246
0.965   1614    532 189.673
0.964   1672    554 182.11
0.963   1736    611 154.486
0.962   1783    647 129.565
0.961   1831    685 113.996
0.960   1897    730 89.2403
0.959   1964    779 71.2079
0.958   2026    814 62.573
0.957   2078    865 70.4274
0.956   2147    950 63.3342
0.955   2220    1009    78.4094
0.954   2272    1075    142.431
0.953   2329    1125    170.49
0.952   2399    1181    186.599
0.951   2455    1246    216.113
0.955

So RMT still settles on 0.95, and the extracted network is still global. However, I should note that even if I extract the network at 0.85, I still get a global network. So I think that 0.85 is not necessarily the correct threshold for Yeast when using GMMs. If 0.85 worked with KINCv1, then my guess would be that KINCv3 is producing a different similarity matrix compared to KINCv1.

spficklin commented 5 years ago

@bentsherman what do you mean by "global" network? Both the GMM and the traditional should be a global network (at 0.85 or 0.95 cutoff). The network is only non-global if you then filter the edges down to condition-specific, which RMT doesn't do. I just want to make sure I'm fully understanding the problem.

bentsherman commented 5 years ago

My apologies, I should have simply said that the network is not scale-free, that's what I meant.

spficklin commented 5 years ago

If the code that populates the matricies that feed into RMT do not have bugs then I do not think the problem is with the implementation. But it could be a side-effect of using GMMs, which I can give some thoughts to, but it would require a meeting as it would be difficult to explain here.

spficklin commented 5 years ago

Given our conversation today, I think we can close this out, as long as you, @bentsherman, are confident that the data being provided to the RMT code from the cluster files is correct.

SystemsGenetics / KINC

Yeast dataset not producing a scale-free network with GMMs #93