Closed duzc-Repos closed 4 years ago
Can you check if the genome used to build the index and the genome used to check sequence are same?
Can you check if the genome used to build the index and the genome used to check sequence are same?
I have checked. Here are the details. Maybe you can try it?
691889_GTGCGTAATAATCTATCAA_1/1 GAAGAAATTGGCTTTAGCATTGGTTGGGTCCAATGCATTCTACTGAATATCAGTTGATATTGCCCCATTCTGAGCTCT
Location: chr1 2107389 2107312
I have downloaded the Arabidopsis thaliana genome from ensembl. (ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz)
build index
hisat2-build -p 5 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa genome
align
hisat2 -x genome -c -U GAAGAAATTGGCTTTAGCATTGGTTGGGTCCAATGCATTCTACTGAATATCAGTTGATATTGCCCCATTCTGAGCTCT
...
0 16 1 2107312 60 78M * 0 0 AGAGCTCAGAATGGGGCAATATCAACTGATATTCAGTAGAATGCATTGGACCCAACCAATGCTAAAGCCAATTTCTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:78 YT:Z:UU NH:i:1
get sequence
samtools faidx Arabidopsis_thaliana.TAIR10.dna.toplevel.fa 1:2107312-2107389
>1:2107312-2107389 AGAGCTCAGAATGGGGCAATATCAACTGATATTCAGTAGAATGCATTGGACCCAACCAAT GCTAAAGCCAATTTCTTC
You can also extract the original genomic sequence from the index file.
hisat2-inspect genome > extract.fa
I have downloaded the Arabidopsis thaliana genome from ensembl. (ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz)
build index
hisat2-build -p 5 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa genome
align
hisat2 -x genome -c -U GAAGAAATTGGCTTTAGCATTGGTTGGGTCCAATGCATTCTACTGAATATCAGTTGATATTGCCCCATTCTGAGCTCT ...
0 16 1 2107312 60 78M * 0 0 AGAGCTCAGAATGGGGCAATATCAACTGATATTCAGTAGAATGCATTGGACCCAACCAATGCTAAAGCCAATTTCTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:78 YT:Z:UU NH:i:1
get sequence
samtools faidx Arabidopsis_thaliana.TAIR10.dna.toplevel.fa 1:2107312-2107389
>1:2107312-2107389 AGAGCTCAGAATGGGGCAATATCAACTGATATTCAGTAGAATGCATTGGACCCAACCAAT GCTAAAGCCAATTTCTTC
You can also extract the original genomic sequence from the index file.
hisat2-inspect genome > extract.fa
Sorry for bothering you with this matter. I know where I went wrong. It's my fault. Thank you again for your help !
Hi, Recently, I mapped my paired-end RNA-seq data to Arabidopsis thaliana genome (ensembl) using HISAT2 with default parameters. However, the whole reads mapped to "N" region with tag NH:i:1. e.g. (not real sequence) ACTATCTATCTATCTATCATCTATCTAGCTAGCTAGTCGATCGTA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN I wonder why this happened. Does this mean that mapping to N have the highest score? But there are many regions like this in the genome. Looking forward to your reply! Thank you! DU Zongchang