Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

Difference about N50 between RepeatMedler and Quast #121

Closed kuangtianhui closed 3 years ago

kuangtianhui commented 3 years ago

I ran RepeatModeler, and I find Repeatmodeler also show the N50 about the assembly. It is bewildering that this N50 are smaller about 2 MB than the Quast' results of N50. How this difference occur???

rmhubley commented 3 years ago

Given a sorted list of contig lengths (longest to shortest), N50 is defined as the length of the shortest contig at the point in the list where 50% of the genome (assembly) is reached. I am not sure how Quast calculates it but RepeatModeler uses the following perl code:

@contigSizes = sort { $a <=> $b } @contigSizes; my $tabSize = 0; my $halfAssembly = $dbSize / 2; my $N50 = 0; for ( my $i = 0 ; $i < $#contigSizes ; $i++ ) { last if ( $tabSize + $contigSizes[ $i ] > $halfAssembly ); $tabSize += $contigSizes[ $i ]; $N50 = $contigSizes[ $i ]; } print " - N50 = $N50\n";

kuangtianhui commented 3 years ago

Thank your very much for your kind reply. It's very useful for me, and I will check how Quast calculate the N50. Thanks again!!!

rmhubley commented 3 years ago

I should have looked at this more closely. I updated the definition in my initial answer to be clear about the sorting order. It should be from longest to shortest. As you might have noticed the perl code I listed incorrectly sorts shortest to longest. This will often produce the same result if the distribution of contig sizes is rather smooth at the 50% point. However as you noticed it can be quite different otherwise. The sort line should look like this:

@contigSizes = sort { $b <=> $a } @contigSizes;

We will correct this in the next release of RepeatModeler but you can, of course, modify this one line in your installation to get the same result. Thank you for bringing this to our attention.

kuangtianhui commented 3 years ago

Thanks!!!!!!! Thank your for your kindness and patience!!!

jebrosen commented 3 years ago

This bug is one of several that have been fixed in RepeatModeler 2.0.2!

Thanks again for reporting this, and if you or anyone else reading this have other questions or problems please let us know.