ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 112 forks source link

Interpreting mafStats #1294

Closed IsaacDiaz026 closed 7 months ago

IsaacDiaz026 commented 7 months ago

Hello,

I just finished running progressive cactus, and used cactus-hal2maf with --filterGapCausingDupes --dupeMode consensus.

When I check my maf file with mafStats, it reports 845 unique sequences, ordered by # bases present: PWN.Scaffold_3A: 50004273 ( 2.38%) PWN.Scaffold_5A: 38824849 ( 1.85%) PWN.Scaffold_2A: 33553799 ( 1.60%)

I notice that the scaffolds with the highest % bases present belong to my reference genome. But I don't fully understand the percentage. Does this mean 2.3% of PWN.Scaffold_3A is represented in the alignment ? Or does this mean the total length of PWN.Scaffold_3A represent 2.3% of the total sequence? I am trying to get a sense of how succesful the alignment was. Here is the rest of my stats file.

File size: 3.13 GB Lines: 12867087 Header lines: 1 s lines: 10511122 e lines: 0 i lines: 0 q lines: 0 Blank lines: 2355956 Comment lines: 8

Sequence chars: 2350499174 ( 84.65%) Gap chars: 426254196 ( 15.35%) Columns: 309915871

Blocks: 1177821 Ave block area: 2357.53 Max block area: 116655 Total block area: 2776753370 Ave block degree: 8.92 Max block degree: 11 Ave seq field length: 199.76 Max seq field length: 10774

glennhickey commented 7 months ago

Good question. I think it's the second (percentage of alignment), but am not super familliar with this tool . The Ave block degree: 8.92 is the average row count. So the coverage is pretty high on average (assuming you have 11 genomes).