Closed bentsherman closed 5 years ago
Hi @bentsherman. Thanks for noticing. Just to clarify.... For the sample string that gets output when creating network files. In the case of these pair-wise comparisons will the sample string be all 1's? What about missing values (those with a 9)? Will the 9's be added in as that can be determined from the expression matrix.
Yes it already infers the 9s so in that case the sample string is all 1s and 9s.
I think the effect is minimal. A few outliers won't affect dramatically the condition specific p-values which we use the sample strings to calculate. But, we should fix it. I think option 2 is what we need.
As there are generally few outliers, what if we stored the sample strings with some sort of encoding (RLE).... Or does the CCM file expect a fixed sized for every sample string?
The CCM requires a fixed size. Its code would have to be scrapped 100% and redesigned from scratch to accommodate a variable length sample mask. This would also possibly sacrifice the ability for quickly looking up specific gene pair sample masks within the data file, because off the top of my head I cannot think of a way to do that with variable length data.
We should fix the problem, and it sounds like some brainstorming is in order.
I just ran some experiments to measure the effect of fixing this issue on CCM data size. I modified Similarity to save the sample string for every correlation that is saved.
Before:
-rw-rw-r-- 1 eceftl3 eceftl3 12204055 Mar 5 09:49 Yeast-1000.ccm
-rw-rw-r-- 1 eceftl3 eceftl3 624813394 Mar 5 21:16 Yeast.ccm
After:
-rw-rw-r-- 1 eceftl3 eceftl3 13252389 Mar 5 10:50 Yeast-1000.ccm
-rw-rw-r-- 1 eceftl3 eceftl3 676023346 Mar 5 15:26 Yeast.ccm
These numbers suggest an ~8% increase in CCM size (8.5% for Yeast-1000 and 8.2% for Yeast). So I think it's probably worth including this fix.
Did you have a threshold cutoff (e.g. +-0.5) or does that represent everything?
Yes that was with the default threshold of 0.5.
Merged into master.
@spficklin In reviewing the KINC code with @4ctrl-alt-del today something came up that I meant to say a while back. Basically, the CCM file in its current design does not save sample strings for gene pairs with only one cluster. The idea is that in the one-cluster case, the sample string can be inferred entirely from the expression matrix. However this is not true for sample codes 6 and 7. That is, samples that were removed by expression threshold or outlier removal cannot be recovered from the expression matrix. I suppose the thresholded samples could be recovered if you provide the original expression threshold, but for outliers you would have to re-run outlier removal during extract.
So I just wanted to run this by you to see if this is a significant issue or not. Here are the scenarios:
Stephen let me know if you think this is a serious issue and if so we can talk about how to deal with it.