estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

Interpretation of scores #19

Closed Karimi-81 closed 2 years ago

Karimi-81 commented 3 years ago

Hi there, Thank you for developing chromeister. I have used the software for a series of genome comparisons. Firstly, I compared two versions of genome assemblies from the same species. One of them is too fragmentated (~7000 scaffolds) and the other one is chromosome-level assembly with 16 large scaffolds (chr). I expected to see a small score for this comparison but the score value was 0.73. As you mentioned in your manuscript, the scores close to 0 indicates the exact same sequences and 1 indicates absolutely no similarity (if I am right). I wonder if this could be the results of fragmentation in the first genome and if you have any suggestion to improve that. I have also compared the genome assembly of this species (second one) with two other related species and the scores were 0.1 and 0.26, which were reasonable. Both of these species have a chromosome-level assemblies. I would appreciate it if you could guide me in this regard. Best, Karim

estebanpw commented 3 years ago

Hello @Karimi-81

Sorry for the late reply.

Regarding the scoring metric, there are some things to note:

Now, regarding your experiments: without seeing the plots I am mostly guessing, but if there is a lot of fragmentation, and especially if the contigs/scaffolds are not in order, then the resulting signal in the plot will be extremely scattered. Given how the scoring metric is designed, each gap between fragments will count negatively, thus raising the score to 0.73.

On the other cases, 0.1 is usually is an indicator that most of the signal exists, even with large evolutionary events, see for instance: imagen

or this one as well: imagen

In short, if the assembly level is below chromosome, the worse the score can be due to all the gaps and breaks between alignments (and this gets worse if with smaller and smaller contigs). I would recommend that you only use the scoring metric for its original purpose (such as automating scripts), or be careful about its interpretation!

Best regards, Esteban