I'm using the current release of RSEM from bioconda (1.3.3), and I'm getting a strange error in calculate-expression with and without the --star alignment included. While the prepare-references step shows the correct number of transcripts detected, calculate-expression seems to think there are zero transcripts in the RSEM references.
I'm currently trying to write a test case using just spike-in sequences and chromosomes 20, 21, and 22. I've appended "exons" for the spike-in sequences to the GTF file, formatted like such:
I get the same error when I try to use STAR's .toTranscriptome.out.bam file, although I think this example using RSEM's integrated STAR pipeline is a bit clearer in describing what I'm seeing:
While the prepare-reference command shows that it's detecting 13,691 transcripts,
```
rsem-extract-reference-transcripts ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference 0 ../resources/ensembl/202122.gtf.fix.gtf None 0 ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa
Parsing gtf File is done!
../resources/ERCC.filtered.fa is processed!
../resources/ensembl/202122.fa is processed!
13691 transcripts are extracted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!
rsem-preref ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.transcripts.fa 1 ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference
Refs.makeRefs finished!
Refs.saveRefs finished!
../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.idx.fa is generated!
../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.n2g.idx.fa is generated!
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference
STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 03 23:27:01 ..... started STAR run
Jun 03 23:27:01 ... starting to generate Genome files
Jun 03 23:27:03 ..... processing annotations GTF
!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=162055374, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 12
Jun 03 23:27:05 ... starting to sort Suffix Array. This may take a long time...
Jun 03 23:27:06 ... sorting Suffix Array chunks and saving them to disk...
Jun 03 23:29:19 ... loading chunks from disk, packing SA...
Jun 03 23:29:36 ... finished generating suffix array
Jun 03 23:29:36 ... generating Suffix Array index
Jun 03 23:30:36 ... completed Suffix Array index
Jun 03 23:30:36 ..... inserting junctions into the genome indices
Jun 03 23:30:56 ... writing Genome to disk ...
Jun 03 23:30:58 ... writing Suffix Array to disk ...
Jun 03 23:31:08 ... writing SAindex to disk
Jun 03 23:31:23 ..... finished successfully
```
calculate-expression doesn't seem to detect them.
```
root@3c02f770b007:/data/workflow# cat ../results/quant/SRR11286627calculate-expression.log
STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63
STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63
STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 03 23:31:24 ..... started STAR run
Jun 03 23:31:24 ..... loading genome
Jun 03 23:31:43 ..... started mapping
Jun 03 23:36:13 ..... finished mapping
Jun 03 23:36:15 ..... finished successfully
rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM
The SAM/BAM file declares more reference sequences (13691) than RSEM knows (0)!
"rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!
```
I'm using the current release of RSEM from bioconda (1.3.3), and I'm getting a strange error in
calculate-expression
with and without the--star
alignment included. While theprepare-references
step shows the correct number of transcripts detected,calculate-expression
seems to think there are zero transcripts in the RSEM references.I'm currently trying to write a test case using just spike-in sequences and chromosomes 20, 21, and 22. I've appended "exons" for the spike-in sequences to the GTF file, formatted like such:
ERCC-00002 ERCC exon 1 1061 . + . gene_id "gene:ERCC-00002"; transcript_id "transcript:ERCC-00002"; exon_number "1"; gene_name "ERCC-00002"; transcript_name "ERCC-00002";
I get the same error when I try to use STAR's
.toTranscriptome.out.bam
file, although I think this example using RSEM's integrated STAR pipeline is a bit clearer in describing what I'm seeing:While the prepare-reference command shows that it's detecting 13,691 transcripts,
``` rsem-extract-reference-transcripts ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference 0 ../resources/ensembl/202122.gtf.fix.gtf None 0 ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa Parsing gtf File is done! ../resources/ERCC.filtered.fa is processed! ../resources/ensembl/202122.fa is processed! 13691 transcripts are extracted. Extracting sequences is done! Group File is generated! Transcript Information File is generated! Chromosome List File is generated! Extracted Sequences File is generated! rsem-preref ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.transcripts.fa 1 ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference Refs.makeRefs finished! Refs.saveRefs finished! ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.idx.fa is generated! ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference.n2g.idx.fa is generated! STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --genomeFastaFiles ../resources/ERCC.filtered.fa ../resources/ensembl/202122.fa --sjdbGTFfile ../resources/ensembl/202122.gtf.fix.gtf --sjdbOverhang 100 --outFileNamePrefix ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source Jun 03 23:27:01 ..... started STAR run Jun 03 23:27:01 ... starting to generate Genome files Jun 03 23:27:03 ..... processing annotations GTF !!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=162055374, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 12 Jun 03 23:27:05 ... starting to sort Suffix Array. This may take a long time... Jun 03 23:27:06 ... sorting Suffix Array chunks and saving them to disk... Jun 03 23:29:19 ... loading chunks from disk, packing SA... Jun 03 23:29:36 ... finished generating suffix array Jun 03 23:29:36 ... generating Suffix Array index Jun 03 23:30:36 ... completed Suffix Array index Jun 03 23:30:36 ..... inserting junctions into the genome indices Jun 03 23:30:56 ... writing Genome to disk ... Jun 03 23:30:58 ... writing Suffix Array to disk ... Jun 03 23:31:08 ... writing SAindex to disk Jun 03 23:31:23 ..... finished successfully ```calculate-expression doesn't seem to detect them.
``` root@3c02f770b007:/data/workflow# cat ../results/quant/SRR11286627calculate-expression.log STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63 STAR --genomeDir ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ../results/quant/SRR11286627.temp/SRR11286627 --readFilesIn /dev/fd/63 STAR version: 2.7.9a compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source Jun 03 23:31:24 ..... started STAR run Jun 03 23:31:24 ..... loading genome Jun 03 23:31:43 ..... started mapping Jun 03 23:36:13 ..... finished mapping Jun 03 23:36:15 ..... finished successfully rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM The SAM/BAM file declares more reference sequences (13691) than RSEM knows (0)! "rsem-parse-alignments ../resources/ensembl/Homo_sapiens.GRCh38.103Rsem/RsemReference ../results/quant/SRR11286627.temp/SRR11286627 ../results/quant/SRR11286627.stat/SRR11286627 ../results/quant/SRR11286627.temp/SRR11286627.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline! ```