Open zhaotao1987 opened 8 months ago
Similar to this issue, I can't figure out how the coverage (rd:i:n) is calculated in the hap*.p.ctg.gfa file.(or any .gfa output file). Is it the average number of hifi reads (A line) per base across the contig sequence (L line)? For example below, how can the contig h2tg000334l have coverage of 1 (n=0 +1) with 5 somewhat overlapping reads while h2tg000335l has a coverage of 9 (n=8+1) but only 8 reads? Am I missing something? Thanks for your help and patience!
S h2tg000334l LN:i:11407 rd:i:0
A h2tg000334l 0 - m64175e_230414_121819/59835679/ccs 0 4723 id:i:85459 HG:A:m
A h2tg000334l 4063 - m64175e_230414_121819/100666093/ccs 0 3814 id:i:142013 HG:A:m
A h2tg000334l 5099 + m64175e_230414_121819/14877872/ccs 0 3590 id:i:21584 HG:A:m
A h2tg000334l 6681 + m64175e_230414_121819/75500103/ccs 0 3734 id:i:107579 HG:A:m
A h2tg000334l 7463 - m64175e_230414_121819/148113359/ccs 0 3944 id:i:205105 HG:A:m
S h2tg000335l LN:i:11276 rd:i:8
A h2tg000335l 0 + m64175e_230414_121819/163512842/ccs 0 5386 id:i:226118 HG:A:m
A h2tg000335l 146 + m64175e_230414_121819/117310622/ccs 0 5366 id:i:164256 HG:A:m
A h2tg000335l 1282 - m64175e_230414_121819/96929260/ccs 0 5127 id:i:136940 HG:A:m
A h2tg000335l 2640 - m64175e_230414_121819/116656107/ccs 0 4364 id:i:163405 HG:A:m
A h2tg000335l 3932 - m64175e_230414_121819/81986565/ccs 0 3119 id:i:116413 HG:A:m
A h2tg000335l 4128 + m64175e_230414_121819/14878721/ccs 0 3418 id:i:21607 HG:A:m
A h2tg000335l 5376 - m64175e_230414_121819/65339592/ccs 0 4843 id:i:93361 HG:A:m
A h2tg000335l 5876 - m64175e_230414_121819/118162350/ccs 0 5400 id:i:165389 HG:A:m
@jwinternitz I think the .gfa file is reporting only the reads used to built the assembly. duplicated or completely overlapping reads are not present. I tried to draw a plot with the read coverage and it was basically flat, even in the contigs that are marked with high coverage values
Ok, that makes sense. I was just hoping there may have been an output file with all the reads assigned to each contig, to check for potential chimeric contigs based on coverage regions across the sequence. I know that one can map the reads back to contigs using, for example, minimap2, but I'm not confident the same reads used to construct the contig would be present. Thanks for your reply!
Me too! I was looking for that, we probably need to align the reads again on the assembly to get the detailed coverage base by base
And then the question is, should we filter the mapped reads to keep only primary alignments, excluding secondary and supplementary alignments? I think so, but it would be nice to confirm with the alignments used by hifiasm to construct the contigs.
I agree with you, detailed coverage information could be interesting for the analysis of tandem repeat regions and rDNA
Hello, I'm look into the meaning of the columns of the lines in GFA files. Although I've checked the links you recommended, such as https://hifiasm.readthedocs.io/en/latest/interpreting-output.html and https://gfa-spec.github.io/GFA-spec/GFA1.html, But I didn't find what I want.. Ok, we start with the S-lines, for example:
Here, for Line 1, what's the meaning of rd:i:0 ?
For A lines, Line 2 means the read is 13921 bp long (16949-3029+1), so on and so forth? If this is true, the remaining lines indicate the read length is relatively short? For example, Line 5, the reads is only 2861 bp ? (15529-12669+1 ), well, N50 of my reads should be ~ 15Kb.
For L lines:
For my data the final column is always L2:i:0, what is that mean? Also Line 2 means h2tg000251l and h2tg000048l have 15760 bp overlap? is that 100% exact match? Then why not connect and extend the contig? and what does L1:i:20881mean, the start position of the alignment in h2tg000251l ?
Sorry for these naive questions and thanks very much and looking forward to your reply!
Best, Tao