GATB / simka

Simka and SimkaMin are comparative metagenomics methods dedicated to NGS datasets.
https://gatb.inria.fr/software/simka/
GNU Affero General Public License v3.0
46 stars 10 forks source link

SimkaMin output file symmetry #16

Open hjruscheweyh opened 4 years ago

hjruscheweyh commented 4 years ago

Dear SimkaMin Dev,

I recently stumbled upon your Simkamin tool and tried to use it to compare my 4000 datasets against each other to get information on the similarity of these samples.

I found something odd in the output matrices. They don’t seem to be symmetric. Where the upper triangular contains mostly values between 0.0 and 1, the lower triangular matrix contains mostly but not exclusively zeros. I would like understand if the lower triangular matrix would be empty but a non-symmetric output is strange.

In fact it seems that there is always a subpart that is symmetric but its mostly not.

I attached a screenshot of parts of the matrix.

Do you know what to do with this information? Should I only use the column-based distances?

Screenshot 2020-07-30 at 20 07 05

Best and thanks, Hans

hjruscheweyh commented 4 years ago

Adding @qclayssen as he will evaluate the matrix

clemaitre commented 4 years ago

Dear Hans,

Thank you for pointing out this behavior.

Since the last release, SimkaMin is supposed to output fully symmetrical matrices, with the same values in the upper and lower triangular parts of the matrix. So this is clearly a bug. After investigations, it happens when more than 100 datasets are compared. In fact, the distances are computed by "blocks" of 100 datasets, so this is the merging of the different parts of the full matrix that is at issue. During the merging, blocks of 100x100 zeros are put in the lower triangular part (instead of copying the values from the corresponding upper triangular part).

We will try to fix this as soon as possible. But, in the meantime, you can safely use the values of the upper triangular part of the matrix which are correct.

Please let me know, if this is not clear enough.

Best, Claire