Questions about L, A, S -Lines in GFA format.

zhaotao1987 commented 8 months ago

Hello, I'm look into the meaning of the columns of the lines in GFA files. Although I've checked the links you recommended, such as https://hifiasm.readthedocs.io/en/latest/interpreting-output.html and https://gfa-spec.github.io/GFA-spec/GFA1.html, But I didn't find what I want.. Ok, we start with the S-lines, for example:

S       utg007305l      *       LN:i:15672      rd:i:0
S       utg007306l      *       LN:i:15650      rd:i:0
S       utg007307l      *       LN:i:19230      rd:i:5
S       utg007308l      *       LN:i:17406      rd:i:0
S       utg007309l      *       LN:i:17315      rd:i:1

Here, for Line 1, what's the meaning of rd:i:0 ?

A       h2tg000001l     0       +       m84128_231027_101108_s1/174523732/ccs   0       16932   id:i:2582924    HG:A:m
A       h2tg000001l     3029    +       m84128_231027_101108_s1/245106603/ccs   0       16949   id:i:1319739    HG:A:m
A       h2tg000001l     5642    +       m84128_231027_101108_s1/86315159/ccs    0       16365   id:i:942834     HG:A:m
A       h2tg000001l     10180   -       m84128_231027_101108_s1/14877412/ccs    0       14738   id:i:1975099    HG:A:m
A       h2tg000001l     12669   +       m84128_231027_101108_s1/60233120/ccs    0       15529   id:i:1191506    HG:A:m
A       h2tg000001l     12846   -       m84128_231027_101108_s1/190974776/ccs   0       15898   id:i:2621153    HG:A:m
A       h2tg000001l     12882   +       m84128_231027_101108_s1/262801143/ccs   0       17446   id:i:1396225    HG:A:m

For A lines, Line 2 means the read is 13921 bp long (16949-3029+1), so on and so forth? If this is true, the remaining lines indicate the read length is relatively short? For example, Line 5, the reads is only 2861 bp ? (15529-12669+1 ), well, N50 of my reads should be ~ 15Kb.

For L lines:

L       h2tg000247c     -       h2tg000247c     -       16912M  L1:i:0  L2:i:0
L       h2tg000251l     +       h2tg000048l     -       15760M  L1:i:20881      L2:i:0
L       h2tg000264l     +       h2tg000084l     -       16163M  L1:i:15957      L2:i:0
L       h2tg000264l     +       h2tg000119l     -       15044M  L1:i:17076      L2:i:0
L       h2tg000270l     +       h2tg000083l     -       14219M  L1:i:20778      L2:i:0

For my data the final column is always L2:i:0, what is that mean? Also Line 2 means h2tg000251l and h2tg000048l have 15760 bp overlap? is that 100% exact match? Then why not connect and extend the contig? and what does L1:i:20881mean, the start position of the alignment in h2tg000251l ?

Sorry for these naive questions and thanks very much and looking forward to your reply!

Best, Tao

jwinternitz commented 8 months ago

Similar to this issue, I can't figure out how the coverage (rd:i:n) is calculated in the hap*.p.ctg.gfa file.(or any .gfa output file). Is it the average number of hifi reads (A line) per base across the contig sequence (L line)? For example below, how can the contig h2tg000334l have coverage of 1 (n=0 +1) with 5 somewhat overlapping reads while h2tg000335l has a coverage of 9 (n=8+1) but only 8 reads? Am I missing something? Thanks for your help and patience!

S h2tg000334l LN:i:11407 rd:i:0
A h2tg000334l 0 - m64175e_230414_121819/59835679/ccs 0 4723 id:i:85459 HG:A:m A h2tg000334l 4063 - m64175e_230414_121819/100666093/ccs 0 3814 id:i:142013 HG:A:m A h2tg000334l 5099 + m64175e_230414_121819/14877872/ccs 0 3590 id:i:21584 HG:A:m A h2tg000334l 6681 + m64175e_230414_121819/75500103/ccs 0 3734 id:i:107579 HG:A:m A h2tg000334l 7463 - m64175e_230414_121819/148113359/ccs 0 3944 id:i:205105 HG:A:m S h2tg000335l LN:i:11276 rd:i:8
A h2tg000335l 0 + m64175e_230414_121819/163512842/ccs 0 5386 id:i:226118 HG:A:m A h2tg000335l 146 + m64175e_230414_121819/117310622/ccs 0 5366 id:i:164256 HG:A:m A h2tg000335l 1282 - m64175e_230414_121819/96929260/ccs 0 5127 id:i:136940 HG:A:m A h2tg000335l 2640 - m64175e_230414_121819/116656107/ccs 0 4364 id:i:163405 HG:A:m A h2tg000335l 3932 - m64175e_230414_121819/81986565/ccs 0 3119 id:i:116413 HG:A:m A h2tg000335l 4128 + m64175e_230414_121819/14878721/ccs 0 3418 id:i:21607 HG:A:m A h2tg000335l 5376 - m64175e_230414_121819/65339592/ccs 0 4843 id:i:93361 HG:A:m A h2tg000335l 5876 - m64175e_230414_121819/118162350/ccs 0 5400 id:i:165389 HG:A:m

mirkocelii commented 8 months ago

@jwinternitz I think the .gfa file is reporting only the reads used to built the assembly. duplicated or completely overlapping reads are not present. I tried to draw a plot with the read coverage and it was basically flat, even in the contigs that are marked with high coverage values

jwinternitz commented 8 months ago

Ok, that makes sense. I was just hoping there may have been an output file with all the reads assigned to each contig, to check for potential chimeric contigs based on coverage regions across the sequence. I know that one can map the reads back to contigs using, for example, minimap2, but I'm not confident the same reads used to construct the contig would be present. Thanks for your reply!

mirkocelii commented 8 months ago

Me too! I was looking for that, we probably need to align the reads again on the assembly to get the detailed coverage base by base

jwinternitz commented 8 months ago

And then the question is, should we filter the mapped reads to keep only primary alignments, excluding secondary and supplementary alignments? I think so, but it would be nice to confirm with the alignments used by hifiasm to construct the contigs.

mirkocelii commented 8 months ago

I agree with you, detailed coverage information could be interesting for the analysis of tandem repeat regions and rDNA

chhylp123 / hifiasm

Questions about L, A, S -Lines in GFA format. #596