Closed y9c closed 2 years ago
Hello Chang,
Thank you very much for this issue!!! I introduced this bug in last commit (910c979fdd422d7ae0d78a554cb3e5187b06d267
), and I just fixed it. I am really sorry about this bug. HISAT-3N should be normal now.
There are 4 different cases a read can be mapped to reference(genome): Case 1. Forward sequence of the read mapped to forward strand of reference (Watson strand) Case 2. Forward sequence of the read mapped to Reverse strand of reference (Crick strand) Case 3. Reverse complement sequence of the read mapped to forward strand of reference (Watson strand) Case 4. Reverse complement sequence of the read mapped to Reverse strand of reference (Crick strand)
Yf tag will report the conversion number. For example, with --base-change C,T
option, Yf report the C->T count for case 1 and 4, G->A count for case 2 and 3.
XM tag will report mismatch and not include Yf tag.
For example with --base-change C,T
option:
Reference Sequence: NNNNNNNAAACCCCGGGGGTTTTTTTTTNNNNNNNNN
* --- *
Read Sequence: CAACTTTGGGGGTTTATTTTT
Yf is 3. XM is 2. The MD tag will show all the differences between reference and read sequence. So the MD tag should be: MD:Z:A3C0C0C8T5
Sorry again for the bug. Leo
By reading the code, I found that badConversion
is assign to XM
tag. So I think this is consistent with my understanding. But the badConversion
is not calculated properly?
Thank you very much @imzhangyun.
Can I have another tag, like Yn
to count number on unconverted based in the read. It will be 1 in your example.
Thank you for your suggestion. Since the Yn
tag could make the output SAM file bigger, we may not add Yn
tag to current HISAT-3N.
Also Yn
and Yf
may not represent all the C
in read/reference sequence.
For example:
Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
** --- ****
Read Sequence: AAAGGCTTTGGGGGTTTTTCCCC
In this case, Yn
is 1, XM
is 6. Yf
is 3. Yn
+ Yf
!= count of C
in the reference or read sequence. I think it is better to give this job to downstream analysis software/script.
Best, Leo
How about make all the tags optional, just in the same way as STAR aligner do? I wonder if there is any concern to make Yn + Yf= C counts?
Hello Chang,
Sorry for the last response. I just add a new tag Zf:i:<N>
. The Zf
tag counts the unconverted base in the read sequence.
Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
** --- ****
Read Sequence: AAAGGCTTTGGGGGTTTTTCCCC
In this case, Zf
= 1, Yf
= 3, XM
= 6. (--base-change C,T
)
Too get the HISAT-3N
with Zf tag, please check our "hisat-3n_Zf" branch. I tested it and it looks OK. Could you also test it and see if it works for you? Then I will merge it to the "hisat-3n' branch.
Best, Leo
Cool. Thank you very much for your help.
On Fri, Oct 8, 2021 at 2:58 PM Yun (Leo) Zhang @.***> wrote:
Hello Chang,
Sorry for the last response. I just add a new tag Zf:i:
. The Zf tag counts the unconverted base in the read sequence. Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN --- ** Read Sequence: AAAGGCTTTGGGGGTTTTTCCCC
In this case, Zf = 1, Yf = 3, XM = 6. (--base-change C,T)
Too get the HISAT-3N with Zf tag, please check our "hisat-3n_Zf" branch. I tested it and it looks OK. Could you also test it and see if it works for you? Then I will merge it to the "hisat-3n' branch.
Best, Leo
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/hisat2/issues/328#issuecomment-939083355, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKEVWXYCFXOIKM2SL4PCLUF5ENFANCNFSM5FSCE6ZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
The
Yf
tag in hisat2-3n is a little confusing. When run hisat2-3n with--base-change C,T
argument, I guess the "number of conversion" should be the number of C bases that converted into T. But I found that this tag is reporting other kinds of mutations (mismatch), say G to T mutation, and the number of C to T mutation is not included. However, theXM
tag reports the exact number of C to T conversion. This looks very weird, because theYf
andXM
tag are documented asI would like to know if it is possible to report number of non C to T conversion in to XM tag, and use another two tags to report number of C to T conversion and number of non C to T conversion? C to T is an example, this can also be applied to other settings.