alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 504 forks source link

CIGAR and query sequence are of different length (samtools after STAR) #2155

Open vitoriastavis opened 3 months ago

vitoriastavis commented 3 months ago

Hello, apologies in advance for the long issue. I'm using STAR and facing this error when using samtools:

$ samtools view -b Aligned.out.sam > out.bam

[E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1_sam] Parse error at line 711 samtools view: error reading file "Aligned.out.sam"

Those were the commands for generating the genome and mapping, respectively:

nohup STAR --runMode genomeGenerate --genomeDir . --genomeFastaFiles ../GCF_000001405.40_GRCh38.p14_genomic.fna --sjdbGTFfile ../genomic.gtf --runThreadN 12 > Log.out 2>&1 &

nohup STAR --genomeDir ../star_genome/ --readFilesIn ../hsa_rna.fna --sjdbGTFfile ../genomic.gtf --runThreadN 12 > Log.out 2>&1 &

This is line 711 out of 744 lines of the .sam file:

NM_001286270.2 0 NC_000016.10 4495858 3 170M118N121M6477N216M2524N127M1284N15M 0 0 ATTGTCCACTAAGGTCTGGCAGGTCTGATTGCCTCTTTTCAGGCACTGAGTGGTGGGGTATGCCATCCTCCCCTGCTGGAACCAGCCTTGGCCTGCCCTGTTAGTCATCAAAAATAGATCTCACCAGGGAACAATCTTCTCAGGTTGTTGTGTAATTTGAGTGAGCCAAGATGGAGTCTCGCTCTGTTGCCCAGGCTGGAGTGCAGTGGATCAGTCTAGCTCATTGCAGCCTCCACCTCCTGGGTTCAAGAGATTCTCCTGCCTCAGCCTCCTGTGTAGCTGGGATTACAGAGTCTTACTTTGTCGGCCAGGCTGGAGTGCAGTGGCATGATCTCGACTCACTGCAACCTCTGTCTCCCAGGCTCAAGAAATCCTCCTACATCAGCCTCCCAAGTAGCTGGGATTACAGGCTGGAGTGCAGTGGCTCCATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGACCAGAGGAGCGAGAGCAGCAAGAACCACACCCAGCAGCAATGTCAGCGGAAGTGGAAACCTCAGAGGGGGTAGACGAGTCAGAAAAAAAGAACTCTGGGGCCCTAGAAAAGGAGAACCAAATGAGAATGGCTGACCTCTCG NH:i:2 HI:i:1 AS:i:654 nM:i:0

If I delete this line, the error just points to the next line.

Then I've seen this issue

I tried reducing the threads to 8 with STAR from Linux_x86_64_static. The only difference is that the error points to line 710.

In this case, this is line 710/745:

NR_026717.1 0 NC_000006.12 31971175 0 100M594N549M 00 GTCTGACACAAGCATTAGTGAGATGCTCCCCTCGAAGAATAGTCTTGTTTCTTCTAAGGACTGATTCTCACCCCGGCTTTGGCTCTCCTAATTTTAGAGGGTCCTCCAAATGCAGTGAGGTTAGGAAGGACGTCTGCGCTCAGATCAAGAATCCAGTTACCTCAAAGCTCCCCAACTTCCACCTCCGCAGAGCTATGACGTCATGGCAGGCACGCCAGAGGCCGAAGGATGCAAAAGTGGTTTTCTGCTTTCGATGATGCAATCATTCAGCGACAGTGGCGGGCAAACCCCTCCCGGGGCGGGGGAGGTGTGAGCTTCACGAAGGAGGTTGACACCAACGTGGCCACCGGCGCCCCTCCACGCCGCCAACGAGTCCCCGGGCGTGCGTGCCCTTGGAGGGAGCCAATCCGCGGCCGGCGTGGGGCCCGGCCTGGCGGAGGTGATGCTGGTATGTGCGTCGCCACCGCCCCTCCCAGCACTGACGGGCCTGAGGGACGACAAGTTGACGCTCCTTTCGTCATCACCTGGTCTAGGAGGGACGCCCGGGGAGACCGTACGTCACTGCTCTGCGCCGGAAGACCCTATTTTCAGGTTCTCTTCCCTCCATTCCTACCCCTTCCCCGGTACCATAAAATCCCGGGATATGAGCT NH:i:6 HI:i:1 AS:i:648 nM:i:0

When running STAR from Linux_x86_64, I got this error: version `GLIBC_2.29' not found

You mentioned 'Cut a few thousands reads around the problematic read and run mapping.' I'm not sure how to do that.

I'm mapping the Homo sapiens transcripts to the RefSeq genome and annotations. All of them can be found here

I'm using:

Ubuntu 18.04.3 LTS STAR 2.7.11b samtools 1.20

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU(s): 12 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 CPU family: 6 Model name: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz

39G RAM, 8G swap

I appreciate any insight!