DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

[Feature request] Add tag for reporting number of converted/unconverted bases #328

Closed y9c closed 2 years ago

y9c commented 2 years ago

The Yf tag in hisat2-3n is a little confusing. When run hisat2-3n with --base-change C,T argument, I guess the "number of conversion" should be the number of C bases that converted into T. But I found that this tag is reporting other kinds of mutations (mismatch), say G to T mutation, and the number of C to T mutation is not included. However, the XM tag reports the exact number of C to T conversion. This looks very weird, because the Yf and XM tag are documented as

Yf:i:<N>: Number of conversions are detected in the read.

XM:i:<N> : The number of mismatches in the alignment. 

I would like to know if it is possible to report number of non C to T conversion in to XM tag, and use another two tags to report number of C to T conversion and number of non C to T conversion? C to T is an example, this can also be applied to other settings.

imzhangyun commented 2 years ago

Hello Chang,

Thank you very much for this issue!!! I introduced this bug in last commit (910c979fdd422d7ae0d78a554cb3e5187b06d267), and I just fixed it. I am really sorry about this bug. HISAT-3N should be normal now.

There are 4 different cases a read can be mapped to reference(genome): Case 1. Forward sequence of the read mapped to forward strand of reference (Watson strand) Case 2. Forward sequence of the read mapped to Reverse strand of reference (Crick strand) Case 3. Reverse complement sequence of the read mapped to forward strand of reference (Watson strand) Case 4. Reverse complement sequence of the read mapped to Reverse strand of reference (Crick strand)

Yf tag will report the conversion number. For example, with --base-change C,T option, Yf report the C->T count for case 1 and 4, G->A count for case 2 and 3.

XM tag will report mismatch and not include Yf tag.

For example with --base-change C,T option:

Reference Sequence: NNNNNNNAAACCCCGGGGGTTTTTTTTTNNNNNNNNN
                           *   ---        *
Read Sequence:             CAACTTTGGGGGTTTATTTTT

Yf is 3. XM is 2. The MD tag will show all the differences between reference and read sequence. So the MD tag should be: MD:Z:A3C0C0C8T5

Sorry again for the bug. Leo

y9c commented 2 years ago

By reading the code, I found that badConversion is assign to XM tag. So I think this is consistent with my understanding. But the badConversion is not calculated properly?

y9c commented 2 years ago

Thank you very much @imzhangyun.

Can I have another tag, like Yn to count number on unconverted based in the read. It will be 1 in your example.

imzhangyun commented 2 years ago

Thank you for your suggestion. Since the Yn tag could make the output SAM file bigger, we may not add Yn tag to current HISAT-3N.

Also Yn and Yf may not represent all the C in read/reference sequence. For example:

Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
                              ** ---          ****
Read Sequence:             AAAGGCTTTGGGGGTTTTTCCCC

In this case, Yn is 1, XM is 6. Yf is 3. Yn + Yf != count of C in the reference or read sequence. I think it is better to give this job to downstream analysis software/script.

Best, Leo

y9c commented 2 years ago

How about make all the tags optional, just in the same way as STAR aligner do? I wonder if there is any concern to make Yn + Yf= C counts?

imzhangyun commented 2 years ago

Hello Chang,

Sorry for the last response. I just add a new tag Zf:i:<N>. The Zf tag counts the unconverted base in the read sequence.

Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN
                              ** ---          ****
Read Sequence:             AAAGGCTTTGGGGGTTTTTCCCC

In this case, Zf = 1, Yf = 3, XM = 6. (--base-change C,T)

Too get the HISAT-3N with Zf tag, please check our "hisat-3n_Zf" branch. I tested it and it looks OK. Could you also test it and see if it works for you? Then I will merge it to the "hisat-3n' branch.

Best, Leo

y9c commented 2 years ago

Cool. Thank you very much for your help.

On Fri, Oct 8, 2021 at 2:58 PM Yun (Leo) Zhang @.***> wrote:

Hello Chang,

Sorry for the last response. I just add a new tag Zf:i:. The Zf tag counts the unconverted base in the read sequence.

Reference Sequence: NNNNNNNAAACCCCCCGGGGGTTTTTTTTTNNNNNNNNN --- ** Read Sequence: AAAGGCTTTGGGGGTTTTTCCCC

In this case, Zf = 1, Yf = 3, XM = 6. (--base-change C,T)

Too get the HISAT-3N with Zf tag, please check our "hisat-3n_Zf" branch. I tested it and it looks OK. Could you also test it and see if it works for you? Then I will merge it to the "hisat-3n' branch.

Best, Leo

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/hisat2/issues/328#issuecomment-939083355, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKEVWXYCFXOIKM2SL4PCLUF5ENFANCNFSM5FSCE6ZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.