Closed apregier closed 4 years ago
Here is an example of A line:
A utg000001l 2093 - SRR11606870.1250244 0 16611 id:i:1250243 HG:A:a A utg000001l 3572 - SRR11606870.2648803 0 17169 id:i:2648802 HG:A:a
For SRR11606870.1250244, 'utg000001l' is the unitig name, 2093 and (3572 - 1) is the start position and end position in utg000001l, '-' is the direction in utg000001l, 'HG:A:a' is the haplotype label of SRR11606870.1250244 which is only useful for haplotype-resolved assembly like trio-binning.
As for cigar, I personally think most reads are exactly mapped to unitig since all reads have been corrected. I'm not sure if this assumption is enough for your project...
Thanks for the quick response - just clarifying, there are 2 A
lines above. It looks like they refer to different reads. I assumed that 2093
and 3572
are the start coordinates of alignment on the unitig and 0
is the start coordinate on the read in each line, and 16611
and 17169
are either the end coordinates or the alignment lengths?
I agree the reads should match the contigs almost exactly - for now it is useful enough for me to just to pull the coordinates, although if it would be easy to put in a cigar, or even better paf-style cs tag that would be really helpful, since I assume there will be some bubbles that are popped, correct? Or would the popped reads just get discarded?
It is technically difficult to output CIGAR because a small fraction of reads are not mapped exactly. We have no plan to deal with those. In addition, everything on the A-line is derived from corrected, not raw reads. CIGARs from corrected reads won't be useful to you anyway.
Or would the popped reads just get discarded?
Popped reads and contained reads are discarded. If you want to recruit reads in a region, you have to redo read alignment against the contigs.
Ok, thank you anyway!
I agree. Another possible solution is to jointly check p_ctg, a_ctg and r_utg. It will be more accurate in extreme cases, but maybe not as easy as directly alignment.
It is technically difficult to output CIGAR because a small fraction of reads are not mapped exactly. We have no plan to deal with those. In addition, everything on the A-line is derived from corrected, not raw reads. CIGARs from corrected reads won't be useful to you anyway.
Or would the popped reads just get discarded?
Popped reads and contained reads are discarded. If you want to recruit reads in a region, you have to redo read alignment against the contigs.
I assumed that 2093 and 3572 are the start coordinates of alignment on the unitig
Yes
and 0 is the start coordinate on the read in each line and 16611 and 17169 are either the end coordinates or the alignment lengths?
No, 16611 or 17169 is just read length, instead of alignment length.
Hi, I wonder what the number 1250243 after the "id:i:" in "id:i:1250243" means. thanks!
It is the read ID, which is only useful for hifiasm itself.
i see. thanks!
I am wondering about the gfa output - is there any documentation about the lines beginning with
A
? I assume that is giving the locations of the reads on the unitigs/contigs. I would really like to be able to extract out the supporting reads for any location on the contig (for example, to find how many reads support a variant between assemblies). It would also be nice to have the cigar to know exactly how the read lines up on the contig. It would save me a step and reduce the ambiguity that could result from having to align the reads back to the assembly after the fact. Would it be easy to add in the cigar?