deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
403 stars 118 forks source link

RSEM reporting zero transcripts in RSEM references during `calculate-expression`, despite `prepare-references` showing correct number #167

Closed acesnik closed 3 years ago

acesnik commented 3 years ago

I'm using the current release of RSEM from bioconda (1.3.3), and I'm getting a strange error in calculate-expression with and without the --star alignment included. While the prepare-references step shows the correct number of transcripts detected, calculate-expression seems to think there are zero transcripts in the RSEM references.

I'm currently trying to write a test case using just spike-in sequences and chromosomes 20, 21, and 22. I've appended "exons" for the spike-in sequences to the GTF file, formatted like such:

ERCC-00002 ERCC exon 1 1061 . + . gene_id "gene:ERCC-00002"; transcript_id "transcript:ERCC-00002"; exon_number "1"; gene_name "ERCC-00002"; transcript_name "ERCC-00002";

I get the same error when I try to use STAR's .toTranscriptome.out.bam file, although I think this example using RSEM's integrated STAR pipeline is a bit clearer in describing what I'm seeing:

While the prepare-reference command shows that it's detecting 13,691 transcripts, ``` rsem-extract-reference-transcripts ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference 0 ../resources/ensembl/202122.gtf.fix.gtf None 0 ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa Parsing gtf File is done! ../resources/ERCC.filtered.fa is processed! ../resources/ensembl/202122.fa is processed! 13691 transcripts are extracted. Extracting sequences is done! Group File is generated! Transcript Information File is generated! Chromosome List File is generated! Extracted Sequences File is generated! rsem-preref ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.transcripts.fa 1 ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference Refs.makeRefs finished! Refs.saveRefs finished! ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.idx.fa is generated! ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.n2g.idx.fa is generated! STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source Jun 03 23:27:01 ..... started STAR run Jun 03 23:27:01 ... starting to generate Genome files Jun 03 23:27:03 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=162055374, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 12 Jun 03 23:27:05 ... starting to sort Suffix Array. This may take a long time... Jun 03 23:27:06 ... sorting Suffix Array chunks and saving them to disk... Jun 03 23:29:19 ... loading chunks from disk, packing SA... Jun 03 23:29:36 ... finished generating suffix array Jun 03 23:29:36 ... generating Suffix Array index Jun 03 23:30:36 ... completed Suffix Array index Jun 03 23:30:36 ..... inserting junctions into the genome indices Jun 03 23:30:56 ... writing Genome to disk ... Jun 03 23:30:58 ... writing Suffix Array to disk ... Jun 03 23:31:08 ... writing SAindex to disk Jun 03 23:31:23 ..... finished successfully ```
calculate-expression doesn't seem to detect them. ``` root@3c02f770b007:/data/workflow# cat ../results/quant/SRR11286627calculate-expression.log STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63 STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63 STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source Jun 03 23:31:24 ..... started STAR run Jun 03 23:31:24 ..... loading genome Jun 03 23:31:43 ..... started mapping Jun 03 23:36:13 ..... finished mapping Jun 03 23:36:15 ..... finished successfully rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM The SAM/BAM file declares more reference sequences (13691) than RSEM knows (0)! "rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline! ```
acesnik commented 3 years ago

This has now started working miraculously. Please disregard.