Loss of some sample information in CCM file during Similarity

bentsherman commented 5 years ago

@spficklin In reviewing the KINC code with @4ctrl-alt-del today something came up that I meant to say a while back. Basically, the CCM file in its current design does not save sample strings for gene pairs with only one cluster. The idea is that in the one-cluster case, the sample string can be inferred entirely from the expression matrix. However this is not true for sample codes 6 and 7. That is, samples that were removed by expression threshold or outlier removal cannot be recovered from the expression matrix. I suppose the thresholded samples could be recovered if you provide the original expression threshold, but for outliers you would have to re-run outlier removal during extract.

So I just wanted to run this by you to see if this is a significant issue or not. Here are the scenarios:

Leave the code as is. Information about samples that were removed by thresholding or outlier detection will be lost for gene pairs with one cluster. Essentially, samples that are '6' and '7' will be inferred as '1' during extract, which may slightly alter things such as correlation and enrichment.
Change Similarity to preserve sample strings which contain '6' or '7', which may significantly increase the size of the CCM file (I would like to run some experiments to see how much bigger).
Change Extract to recover 6s and 7s by re-applying the expression threshold and re-running outlier removal. The first part would be trivial but the second part would significantly increase extract runtime.

Stephen let me know if you think this is a serious issue and if so we can talk about how to deal with it.

spficklin commented 5 years ago

Hi @bentsherman. Thanks for noticing. Just to clarify.... For the sample string that gets output when creating network files. In the case of these pair-wise comparisons will the sample string be all 1's? What about missing values (those with a 9)? Will the 9's be added in as that can be determined from the expression matrix.

bentsherman commented 5 years ago

Yes it already infers the 9s so in that case the sample string is all 1s and 9s.

spficklin commented 5 years ago

I think the effect is minimal. A few outliers won't affect dramatically the condition specific p-values which we use the sample strings to calculate. But, we should fix it. I think option 2 is what we need.

As there are generally few outliers, what if we stored the sample strings with some sort of encoding (RLE).... Or does the CCM file expect a fixed sized for every sample string?

4ctrl-alt-del commented 5 years ago

The CCM requires a fixed size. Its code would have to be scrapped 100% and redesigned from scratch to accommodate a variable length sample mask. This would also possibly sacrifice the ability for quickly looking up specific gene pair sample masks within the data file, because off the top of my head I cannot think of a way to do that with variable length data.

spficklin commented 5 years ago

We should fix the problem, and it sounds like some brainstorming is in order.

bentsherman commented 5 years ago

I just ran some experiments to measure the effect of fixing this issue on CCM data size. I modified Similarity to save the sample string for every correlation that is saved.

Before:

-rw-rw-r--  1 eceftl3 eceftl3   12204055 Mar  5 09:49 Yeast-1000.ccm
-rw-rw-r--  1 eceftl3 eceftl3  624813394 Mar  5 21:16 Yeast.ccm

After:

-rw-rw-r--  1 eceftl3 eceftl3  13252389 Mar  5 10:50 Yeast-1000.ccm
-rw-rw-r--  1 eceftl3 eceftl3 676023346 Mar  5 15:26 Yeast.ccm

These numbers suggest an ~8% increase in CCM size (8.5% for Yeast-1000 and 8.2% for Yeast). So I think it's probably worth including this fix.

spficklin commented 5 years ago

Did you have a threshold cutoff (e.g. +-0.5) or does that represent everything?

bentsherman commented 5 years ago

Yes that was with the default threshold of 0.5.

bentsherman commented 5 years ago

Merged into master.

SystemsGenetics / KINC

Loss of some sample information in CCM file during Similarity #78