deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
404 stars 117 forks source link

miRNA and smallRNA pipeline #78

Open andremrsantos opened 6 years ago

andremrsantos commented 6 years ago

Hi RSEM Staff, I work with miRNA pipelines and would be interesting to run RSEM estimations, especially due to miRNAs small sequence and the high number of hits. I was trying to run the protocol using gencode (v27) miRNAs using the following command:

$ rsem-calculate-expression \
  --seed-length 15 \
  --fragment-length-mean 25 \
  --fragment-length-sd 5 \
  -p 8 \
  --star \
  {INPUT} {REF} B12

This command calls the following STAR command-line.

STAR --genomeDir {REF}  \
  --outSAMunmapped Within  \
  --outFilterType BySJout  \
  --outSAMattributes NH HI AS NM MD  \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999  \
  --outFilterMismatchNoverLmax 0.04  \
  --alignIntronMin 20  \
  --alignIntronMax 1000000  \
  --alignMatesGapMax 1000000  \
  --alignSJoverhangMin 8  \
  --alignSJDBoverhangMin 1  \
  --sjdbScore 1  \
  --runThreadN 8  \
  --genomeLoad NoSharedMemory  \
  --outSAMtype BAM Unsorted \
  --quantMode TranscriptomeSAM  \
  --outSAMheaderHD \@HD VN:1.4 SO:unsorted  \
  --outFileNamePrefix B12.temp/B12  \
  --readFilesIn {INPUT}

Reviewing the ENCODE (miRNA pipeline)[https://www.encodeproject.org/pipelines/ENCPL444CYA/], I found very small changes from the one implemented here. The major differences were the removal of Intron alignments with the flag --alignIntronMax 1 and the inclusing of --alignEndsType EndToEnd. This changes yield the following command:

STAR --genomeDir {REF} \
  --readFilesIn {INPUT} \
  --runThreadN 8 \
  --alignEndsType EndToEnd \
  --outFilterType BySJout  \
  --outSAMattributes NH HI AS NM MD  \
  --outSAMtype BAM Unsorted \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 200  \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.04  \
  --alignSJDBoverhangMin 1 \
  --alignIntronMax 1 \
  --alignSJoverhangMin 8  \
  --sjdbScore 1  \
  --genomeLoad NoSharedMemory  \
  --quantMode TranscriptomeSAM \
  --outSAMheaderHD \@HD VN:1.4 SO:unsorted  \
  --outFileNamePrefix B12.temp/B12

Do you think you could add a protocol option to enable this kind of transcript? Also, it would be nice to regulate the flag --outFilterMultimapNmax. You guys consider at most 20 mutliple-hits for STAR while allow up 200 for bowtie2 (by default).

Great work

map2085 commented 6 years ago

I noticed the same thing.

Note that you can go into the source code and easily change the number of multiple hits reported by STAR. This is a good workaround until the RSEM developers fix it.

Just go to line 389 of "rsem-calculate-expression" perl script and change the value of the flag "--outFilterMultimapNmax" to some big number, e.g. 2000.