alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 501 forks source link

Indexing Causes Change in nucleotide length for a small subset of transcripts #1140

Open benyoung93 opened 3 years ago

benyoung93 commented 3 years ago

Good morning 😊 I am opening this issue on the recommendation of @rob-p and our discussion over on BioStar (https://www.biostars.org/p/486346/#486637).

I was using STAR to index and align my raw reads to the available genome I have. This was done successfully but then using salmon to quantify I received the following error (please note salmon transcriptome was generated using gffread on the same files used for input for STAR indexing).

"SAM file says target evm.model.Sc0a5M3_402_HRSCAF_756.4.1.5f5b2bc4 has length 536, but the FASTA file contains a sequence of length [538 or 537]"

I then just ran salmon in index and alignment mode and this error was not thrown, but quantification successfully processed.

On closer inspection it seems that the indexing stage with STAR is causing the differing lengths of 23 transcripts by 1 or 2 bases and I have absolutely no idea why this is (link to image of transcripts - https://ibb.co/r591n3n). This was done by the following comparisons of the transcript lengths.

Please find below the STAR index script, as well as the alignment script I used. Please let me know if you need anymore information, I have tried to follow best practices but this is my first issue submission 😊

STAR index /nethome/bdy8/programs/STAR \ --runThreadN 8 \ --runMode genomeGenerate \ --genomeDir /nethome/bdy8/apal_genome/v3.1_star_index \ --genomeFastaFiles /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_200911.masked.fasta \ --sjdbGTFfile /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_notrna_v3.1_200911.gff3 \ --sjdbOverhang 100 \ --sjdbGTFtagExonParentTranscript Parent

STAR Alignment /nethome/bdy8/programs/STAR \ --runThreadN 8 \ --genomeDir /nethome/bdy8/apal_genome/v3.1_star_index \ --readFilesIn /scratch/projects/transcriptomics/ben_young/DHE/tagseq/host/trimmed/'"${PALPAL}"'_tr.fastq \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outSAMtype BAM SortedByCoordinate \ --quantMode TranscriptomeSAM GeneCounts \ --outSAMstrandField intronMotif \ --twopassMode Basic \ --outFilterScoreMinOverLread 0.2 \ --outFilterMatchNminOverLread 0.2 \ --twopass1readsN -1 \ --outReadsUnmapped Fastx \ --outFileNamePrefix

GFFread Transcriptome Generation /nethome/bdy8/programs/gffread-0.9.12/gffread \ -w /nethome/bdy8/apal_genome/version3.1/Apalm_gffread_for_salmon.fasta \ -g /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_200911.masked.fasta \ /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_notrna_200911.gff3

alexdobin commented 3 years ago

Hi Benjamin,

could you try to convert GFF3 to the GTF with gffread, and rerun STAR genome generation, and also the Salmon fasta generation with the GTF file? GFF3 sometimes produces inconsistent results.

Cheers Alex

benyoung93 commented 3 years ago

Hi Alex Sorry for the delay they are performing maintenance on our supercomputer. I will get to this next week. Ben

mjsduncan commented 2 years ago

i had the same problem running the bcbio_nextgen "RNA-seq" pipeline using STAR version STAR_2.6.1d where i explicitly used a gtf file created by bcbio to go with the BDGP6 fruit fly genome:

from bcbio_nextgen pipeline command log file:

../anaconda/bin/STAR --genomeDir /home/mozi/biosrc/bcbio/genomes/Dmelanogaster/BDGP6/star/ --readFilesIn <(gunzip -c /mnt/dm/flies/work/rnaseq1/input/B1_1.fq.gz) <(gunzip -c /mnt/dm/flies/work/rnaseq1/input/B1_2.fq.gz) --runThreadN 10 --outFileNamePrefix /mnt/dm/tmp/tmpfchg43mo/B11pass/1_220801_rnaseq1 --outReadsUnmapped Fastx --outFilterMultimapNmax 10 --outStd BAM_Unsorted  --limitOutSJcollapsed 2000000 --outSAMtype BAM Unsorted --outSAMmapqUnique 60 --outSAMunmapped Within --outSAMattributes NH HI NM MD AS  --sjdbGTFfile /mnt/biodata/bcbio/genomes/Dmelanogaster/BDGP6/rnaseq/ref-transcripts.gtf  --sjdbOverhang 149  --outSAMattrRGline ID:B1 PL:illumina PU:1_220801_rnaseq1 SM:B1 --chimSegmentMin 10 --chimOutType WithinBAM --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --peOverlapNbasesMin 10 --alignSplicedMateMapLminOverLmate 0.5  --quantMode TranscriptomeSAM  | /home/mozi/biosrc/bcbio/galaxy/../anaconda/bin/samtools sort -@ 10 -m 3G  -T /mnt/dm/tmp/tmpfchg43mo/B11pass/1_220801_rnaseq1_star/B1-sorttmp -o /mnt/dm/tmp/tmpfchg43mo/B11pass/1_220801_rnaseq1_star/B1.bam /dev/stdin > /mnt/dm/tmp/tmpfchg43mo/B11pass/1_220801_rnaseq1_star/B1.bam 

salmon v1.7.0 halts with a similar error:

$ salmon quant -l A -p 12 -t /mnt/biodata/bcbio/genomes/Dmelanogaster/BDGP6/rnaseq/ref-transcripts.fa -o /mnt/dm/tmp/tmpqbqa26av/B1 -a /mnt/dm/flies/work/rnaseq1/work/align/B1/1_220801_rnaseq1_star/B1.transcriptome.bam --numBootstraps 30 
...
Populating targets from aln = "/mnt/dm/flies/work/rnaseq1/work/align/B1/1_220801_rnaseq1_star/B1.transcriptome.bam", fasta = "/mnt/biodata/bcbio/genomes/Dmelanogaster/BDGP6/rnaseq/ref-transcripts.fa" . . .
SAM file says target FBgn0013687 has length 4608, but the FASTA file contains a sequence of length [4602 or 4601]
' returned non-zero exit status 1.

bcbio_star_index.log

Kekananen commented 2 years ago

I am having the same issue on grch38 human genome using gffread derived files for the transcriptome.fa.

STAR --runThreadN 8 --genomeDir ../resources/genomes/star/ --readFilesIn ../results/transPALLAS/trimming/trimmomaticPE/e-pal-1002-3-c-1-r-HGLWTDSX3-AATGACCTGA-GAGCTCCACA.R1.fastq.gz ../results/transPALLAS/trimming/trimmomaticPE/e-pal-1002-3-c-1-r-HGLWTDSX3-AATGACCTGA-GAGCTCCACA.R2.fastq.gz --readFilesCommand gunzip -c --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outFileNamePrefix e-pal-1002-3-c-1-r-HGLWTDSX3-AATGACCTGA-GAGCTCCACA
salmon quant -t ../resources/genomes/transcriptome/transcriptome.fa -l A -a ../results/transPALLAS/alignment/star/e-pal-1002-3-c-1-r-HGLWTDSX3-AATGACCTGA-GAGCTCCACAAligned.toTranscriptome.out.bam -o ../results/transPALLAS/quantification/salmon/e-pal-1002-3-c-1-r-HGLWTDSX3-AATGACCTGA-GAGCTCCACA --gcBias

throwing SAM file says target ENST00000431238.7 has length 518, but the FASTA file contains a sequence of length [260 or 259]

alexdobin commented 2 years ago

Since these are non-standard fasta/gtf, we will to isolate the problem. Please create fasta/gtf from just the transcript in question, and run STAR genome generation and mapping. Then we can check the header of the SAM file to see if the length of the transcript is correct.

naumenko-sa commented 1 year ago

Hello everyone!

In bcbio we solved it by generating ref-transcripts.full.fa out of the genome reference and ref-transcripts.gtf in genome/Celegans/WBCel235 in our case by using gffread (it was also a custom genome). It indeed gives longer sequences per transcript and is compatible with STAR transcriptome alignments. https://github.com/bcbio/bcbio-nextgen/issues/3671

SN

AlexGaithuma commented 11 months ago

This is the same with Tick genome from NCBI Genome ASM1692078v2

SAM file says target XM_002399893.5 has length 1584, but the FASTA file contains a sequence of length 1590
marciow0205 commented 7 months ago

Good afternoon everyone,

I am new to bioinformatics and lately I have been encountering the same issue, but I only find discussions about it here and in the publications cited by you.

I am using the complete reference genome GRCh38.p14 to map with STAR and then I intend to quantify with Salmon, as I need TPM values.

The problem arises when I use Salmon, it presents the same error mentioned by others, and I identified that for some reason during the creation of the STAR index, it makes this cut.

I tried using the argument "--clipAdapterType None" during mapping, thinking it might be the solution, but it indicates that it does not exist, even though the manual says otherwise.

Could someone explain to me how to adjust this error better? If you need more information, just let me know.

alexdobin commented 6 months ago

Hi @marciow0205

I could not reproduce this problem. If you can isolate the "chromosome" which causes it, I could look into it.

charlesdavid commented 3 months ago

Hi Alex,

We are experiencing the same problem when using the nf-core rnaseq pipeline (version 3.14.0). There seems to be an incompatibility between the indexing done with STAR and the transcript fasta files. One difference that I am seeing is that the length for us is not off by just one or two bases, but by a factor of two. Of note, the nf-core pipeline offers the choice of aligner to be either star_rsem or star_salmon. This issue occurs with BOTH choices, indicating that the issue is related to STAR and not either RSEM or SALMON. Could you have a look at this problem as STAR is an excellent aligner and the quantification from RSEM and SALMON are very useful. Below I provide some details of the pipeline for your consideration: 1)GTF file is entered or created from GFF file 2) Transcripts are extracted from the reference genome 3) GTF is filtered to ensure that only entries that match the transciptome are included 4) STAR index is done on these files 5) Salmon quant is run using the transciptome aligned bam from STAR in alignment mode

Summary of error:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (C222_1)'

Caused by: Process NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (C222_1) terminated with an error exit status (1)

Command executed:

salmon quant \ --geneMap assembly.filtered.gtf \ --threads 6 \ --libType=ISR \ -t genome.transcripts.fa \ -a C222_1.Aligned.toTranscriptome.out.bam \ --noLengthCorrection \ -o C222_1

Command exit status: 1

Command output: (empty)

Command error: Version Info: This is the most recent version of salmon. salmon (alignment-based) v1.10.1 Checking that provided alignment files have consistent headers . . . done Populating targets from aln = "C222_1.Aligned.toTranscriptome.out.bam", fasta = "genome.transcripts.fa" . . .

SAM file says target gene1 has length 2840, but the FASTA file contains a sequence of length [1420 or 1419]

UPDATE

Hi to all who have been experiencing this problem.

I have looked into it a bit and have discovered that in all cases the root cause appears to be a malformed GFF file. Since most of the work we do involves non model organisms, the quality of the annotations is always an issue. The work-around for this particular problem was to not only validate the GFF files, but actually repair them. The solution involves passing the GFF file to multiple programs:

agat_convert_sp_gxf2gxf.pl | gt gff3 | gffread

Doing this resolved the issue for us.

I will re-iterate: Having a "validated" GFF file is NOT sufficient for these indexing problems.

Hope this information helps :-)