SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Use sample string which embeds all clusters to reduce size #40

Closed bentsherman closed 6 years ago

bentsherman commented 6 years ago

It seems that discussions keep spawning more discussions... but I've had this idea for a while and the previous issue about the similarity analytic prompted me to bring it up.

We currently represent the sample string as a list of lists denoting binary membership in a cluster, for example:

00119 10009 01009

This format is highly redundant but allows us to easily include "error" codes like 6, 7, 8, and 9. However, in the cIustering analytics actually use this format:

1200(-9)

So each number actually denotes the cluster index, with negative numbers denoting error codes. Not as readable, but much more compact. Up to this point I would just use this format and convert appropriately when saving to the CCM, but we could also just use this format in the CCM. I know you guys said you were saving samples as 4-bit values but as far as I can tell from ccmatrix.cpp they are still saved as 8-bit values, so if that's the case then this compressed format should reduce the file size by 2-4x depending on the number of clusters per gene pair. Which might be enough savings to allow us to keep our analytics separated.

We can also convert easily between the two formats for things like displaying the CCM and converting between the KINC.R format.

bentsherman commented 6 years ago

We need to get a usable build of KINC out as soon as we can so that our genetics friends at least have something to play with, so I'm going to move forward with the Similarity analytic. Once we fix the CCM bug, make RMT cluster-aware, and apply the new analytic manager system to Similarity, then we'll have a usable build which we can release, and then I think we can focus on features like this one.

Although, as a side note, I wouldn't be surprised if using this format would just remove the source of the CCM bug altogether.

Correction: I should really stop calling it the "CCM bug" because it also occurs in the CMX, so I guess it's really the "GenePair::Base bug". In other words, using this CCM format will likely not remove the source of the bug.

bentsherman commented 6 years ago

Now that I have a thorough understanding of the GenePair::Base class from dealing with #24, I can see that implementing this format might be trickier than I thought. Since each cluster is stored separately, using a sample string format which embeds all of the clusters could require some refactoring of GenePair::Base. I think it's still doable, without changing the CCM / CMX class hierarchy, but it may not be trivial.

bentsherman commented 6 years ago

Now that (1) the CCM format saves samples in 4-bit values and (2) the CCM data is thresholded before it is saved, this potential feature is no longer necessary or viable. Closing.