DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

reference file is not read correctly. #311

Closed y9c closed 2 years ago

y9c commented 2 years ago

If code in the hisat2-3n branch can not parse reference fasta file correctly. The last record of the fasta file is not read.

For example, if chr3 is the last record of the reference fasta file, and the chr3 is reported in the sam file, hisat2-3n stop with error message Cannot find the chromosome: chr3 in reference file..

>chr1
AAAA
>chr2
TTCC
>chr3
ATAG

https://github.com/DaehwanKimLab/hisat2/blob/f8b0dc34e304b1154622a9d9170cbcc8b6ea7db1/position_3n_table.h#L336

imzhangyun commented 2 years ago

Hello Chang Y,

Sorry about the bug. I just update the hisat-3n-code. Could you pull and make it again? Hope this updated hisat-3n-table works for you.

Best,

Leo

y9c commented 2 years ago

Thank you @imzhangyun.

y9c commented 2 years ago

Hi @imzhangyun,

This bug is still exist for some chromosomes, such as,

>snoRNA-URS000003BA79_10090     Mus musculus (house mouse) Z51 small nucleolar RNA
TGTACATGATGAAAACAGTCTCCCTCTTCTGAATCTCGCTGAGGAAACTGCATGTCACCCTCCTGAAAAC
>snoRNA-URS0000042F48_10090     Mus musculus (house mouse) partial derived from hnRNA or mRNA fragment, or novel small non-messenger RNA without known sequence-or structural motifs
AGCTACTCCCCACCACCAGCACCCAAAGCTGGTATTCTAATTAAACTACTTCTTGAGTACATAAATTTACATAGTACAACAGTACATTTATGTAACA
>snoRNA-URS00000672D4_10090     Mus musculus (house mouse) partial C/D box snoRNA; small non-messenger RNA (snmRNA)
AAAAAAAGGAAGTGCCGNCCGATGCGACAACTGACGACATCCCTAGTTAGCTGACT
imzhangyun commented 2 years ago

Hello @y9c ,

I am sorry about this. Is that the same error as original that cannot find the last chromosome? Could you show me the exact error message generated by hisat-3n-table? Also, could you tell me the length of the last chromosome and how many reads mapped to the last chromosome?

y9c commented 2 years ago

Hi @imzhangyun ,

The message is same as the previous one, Cannot find the chromosome: snoRNA-URS0000042F48_10090 in reference file..

This time, the sequence (snoRNA-URS0000042F48_10090 ) is not the last record of the file. It is in the middle of another two records. The exact sequence is show in the previous post.

imzhangyun commented 2 years ago

Did you sort the input SAM/BAM file?

imzhangyun commented 2 years ago

@y9c

I changed some codes in hisat-3n-table. Now it should be good. Please check the code on hisat-3n_TableChromNameFixing branch. I will merge it tomorrow.

y9c commented 2 years ago

Many thanks for the quick response. I will test the bug fix branch and let you know.

Chang

On Thu, Feb 17, 2022, 16:59 Yun (Leo) Zhang @.***> wrote:

@y9c https://github.com/y9c

I changed some codes in hisat-3n-table. Now it should be good. Please check the code on hisat-3n_TableChromNameFixing branch. I will merge it tomorrow.

— Reply to this email directly, view it on GitHub https://github.com/DaehwanKimLab/hisat2/issues/311#issuecomment-1043582297, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJKEVXO32JNQWYGVLL4UGLU3V4T7ANCNFSM5A6BLIQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

y9c commented 2 years ago

Hi Leo,

I figure out that chromosome is not parsed correctly is because not all the record names are separated by whitespace, some reference fasta use tab. So I think it would be better to change this line into:

size_t endPosition = inputLine.find_first_of(" \t");

https://github.com/DaehwanKimLab/hisat2/blob/15e619c97b8c7e897c003179d99bbaa83210b58b/position_3n_table.h#L242-L242

Is it correct?

Chang

y9c commented 2 years ago

https://github.com/DaehwanKimLab/hisat2/pull/349/files

imzhangyun commented 2 years ago

Hello @y9c,

Sorry again for the bug. I believe I solve the problem. Please pull the script from hisat-3n_TableChromNameFixing branch.

Best. Leo