Interpretation of scores

estebanpw / chromeister

A dotplot generator for large chromosomes

GNU General Public License v3.0

39 stars 4 forks source link

Hi there, Thank you for developing chromeister. I have used the software for a series of genome comparisons. Firstly, I compared two versions of genome assemblies from the same species. One of them is too fragmentated (~7000 scaffolds) and the other one is chromosome-level assembly with 16 large scaffolds (chr). I expected to see a small score for this comparison but the score value was 0.73. As you mentioned in your manuscript, the scores close to 0 indicates the exact same sequences and 1 indicates absolutely no similarity (if I am right). I wonder if this could be the results of fragmentation in the first genome and if you have any suggestion to improve that. I have also compared the genome assembly of this species (second one) with two other related species and the scores were 0.1 and 0.26, which were reasonable. Both of these species have a chromosome-level assemblies. I would appreciate it if you could guide me in this regard. Best, Karim

Hello @Karimi-81

Sorry for the late reply.

Regarding the scoring metric, there are some things to note:

It was designed as a way of automatically scoring many comparisons (so to build further processing scripts on top of that in multiple comparisons)
It is based on scoring continuous (adjacent) signals positively and discontinuos (gaps) signals negatively
It does not take into account the actual signal (the alignments)

Now, regarding your experiments: without seeing the plots I am mostly guessing, but if there is a lot of fragmentation, and especially if the contigs/scaffolds are not in order, then the resulting signal in the plot will be extremely scattered. Given how the scoring metric is designed, each gap between fragments will count negatively, thus raising the score to 0.73.

On the other cases, 0.1 is usually is an indicator that most of the signal exists, even with large evolutionary events, see for instance: imagen

or this one as well: imagen

In short, if the assembly level is below chromosome, the worse the score can be due to all the gaps and breaks between alignments (and this gets worse if with smaller and smaller contigs). I would recommend that you only use the scoring metric for its original purpose (such as automating scripts), or be careful about its interpretation!

Best regards, Esteban

estebanpw / chromeister

Interpretation of scores #19