chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
534 stars 87 forks source link

GFA `A` lines #16

Closed apregier closed 4 years ago

apregier commented 4 years ago

I am wondering about the gfa output - is there any documentation about the lines beginning with A ? I assume that is giving the locations of the reads on the unitigs/contigs. I would really like to be able to extract out the supporting reads for any location on the contig (for example, to find how many reads support a variant between assemblies). It would also be nice to have the cigar to know exactly how the read lines up on the contig. It would save me a step and reduce the ambiguity that could result from having to align the reads back to the assembly after the fact. Would it be easy to add in the cigar?

chhylp123 commented 4 years ago

Here is an example of A line:

A utg000001l  2093    -   SRR11606870.1250244 0   16611   id:i:1250243    HG:A:a
A utg000001l  3572    -   SRR11606870.2648803 0   17169   id:i:2648802    HG:A:a

For SRR11606870.1250244, 'utg000001l' is the unitig name, 2093 and (3572 - 1) is the start position and end position in utg000001l, '-' is the direction in utg000001l, 'HG:A:a' is the haplotype label of SRR11606870.1250244 which is only useful for haplotype-resolved assembly like trio-binning.

As for cigar, I personally think most reads are exactly mapped to unitig since all reads have been corrected. I'm not sure if this assumption is enough for your project...

apregier commented 4 years ago

Thanks for the quick response - just clarifying, there are 2 A lines above. It looks like they refer to different reads. I assumed that 2093 and 3572 are the start coordinates of alignment on the unitig and 0 is the start coordinate on the read in each line, and 16611 and 17169 are either the end coordinates or the alignment lengths?

I agree the reads should match the contigs almost exactly - for now it is useful enough for me to just to pull the coordinates, although if it would be easy to put in a cigar, or even better paf-style cs tag that would be really helpful, since I assume there will be some bubbles that are popped, correct? Or would the popped reads just get discarded?

lh3 commented 4 years ago

It is technically difficult to output CIGAR because a small fraction of reads are not mapped exactly. We have no plan to deal with those. In addition, everything on the A-line is derived from corrected, not raw reads. CIGARs from corrected reads won't be useful to you anyway.

Or would the popped reads just get discarded?

Popped reads and contained reads are discarded. If you want to recruit reads in a region, you have to redo read alignment against the contigs.

apregier commented 4 years ago

Ok, thank you anyway!

chhylp123 commented 4 years ago

I agree. Another possible solution is to jointly check p_ctg, a_ctg and r_utg. It will be more accurate in extreme cases, but maybe not as easy as directly alignment.

It is technically difficult to output CIGAR because a small fraction of reads are not mapped exactly. We have no plan to deal with those. In addition, everything on the A-line is derived from corrected, not raw reads. CIGARs from corrected reads won't be useful to you anyway.

Or would the popped reads just get discarded?

Popped reads and contained reads are discarded. If you want to recruit reads in a region, you have to redo read alignment against the contigs.

chhylp123 commented 4 years ago

I assumed that 2093 and 3572 are the start coordinates of alignment on the unitig

Yes

and 0 is the start coordinate on the read in each line and 16611 and 17169 are either the end coordinates or the alignment lengths?

No, 16611 or 17169 is just read length, instead of alignment length.

zhenzhenyang-psu commented 3 years ago

Hi, I wonder what the number 1250243 after the "id:i:" in "id:i:1250243" means. thanks!

chhylp123 commented 3 years ago

It is the read ID, which is only useful for hifiasm itself.

zhenzhenyang-psu commented 3 years ago

i see. thanks!