Does STAR index need to be generated with the FASTA outputted by rsem-prepare-reference or can I use the same genome file provided to rsem-prepare-reference?

deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data

http://deweylab.biostat.wisc.edu/rsem/

GNU General Public License v3.0

403 stars 118 forks source link

Does STAR index need to be generated with the FASTA outputted by rsem-prepare-reference or can I use the same genome file provided to rsem-prepare-reference? #179

Open etrh opened 2 years ago

etrh commented 2 years ago

I am trying to run STAR manually and then provide the transcriptome BAM to RSEM Calculate Expression. I just find the documentation a bit confusing and I am not sure if I'm doing everything correctly.

Here is what the documentation says:

To use an alternative alignment program, align the input reads against the file reference_name.idx.fa generated by rsem-prepare-reference, and format the alignment output in SAM/BAM/CRAM format. Then, instead of providing reads to rsem-calculate-expression, specify the --alignments option and provide the SAM/BAM/CRAM file as an argument

Does this mean that I need to generate my STAR index using the reference_name.idx.fa that rsem-prepare-reference returns? (instead of using the same genome file that I downloaded from Ensembl or GENCODE and provided directly to rsem-prepare-reference?)

cc: @pliu55 @RamRS

pliu55 commented 2 years ago

Hi @etrh,

It is do-able to run STAR manually and provide the transcriptome BAM to rsem-calculate-expression. You can try the --bam option for rsem-calculate-expression. The ENCODE RNA-seq pipeline has an example with more details on this:

rsem-calculate-expression --bam --estimate-rspd --calc-ci --seed ${rnd_seed} -p $ncpus \
    --no-bam-output --ci-memory 30000 ${extra_flags} $anno_bam ${index_prefix} ${bam_root}_rsem

The $anno_bam is the transcriptome BAM you got from STAR.

RamRS commented 2 years ago

Sorry @etrh , I can’t participate in this discussion - I’m a little short on time.

etrh commented 2 years ago

Thank you @pliu55 ( and also @RamRS, I completely understand, no worries :-) )

That makes sense, however my question is regarding STAR index specifically and the proper way of building it if I want to pass my resulting transcriptome BAM to RSEM.

Specifically I wish to know whether while building the STAR index I can use the genome file that I download from Ensembl/GENCODE? Or should I specifically first build the RSEM index and then take the reference_name.idx.fa that rsem-prepare-reference creates and build my STAR index based on that file (i.e. reference_name.idx.fa)? This seems to be what the documentation suggests (https://github.com/deweylab/RSEM#using-an-alternative-aligner)

pliu55 commented 2 years ago

Hi @etrh,

I am not sure if I understand your question correctly. In principle, the preparation for STAR and RSEM reference is independent as long as the same gene annotation and genome sequence files are used. I don't think reference_name.idx.fa from RSEM is required to build STAR reference.

etrh commented 2 years ago

@pliu55 I'm specifically referring to this text from the manual:

To use an alternative alignment program, align the input reads against the file reference_name.idx.fa generated by rsem-prepare-reference

Am I misunderstanding something here? To me it sounds like the text above specifically expects the STAR index to be generated from the FASTA that rsem-prepare-reference generates.

etrh commented 2 years ago

@bli25 / @alexdobin Would you happen to know the correct approach here when using STAR + RSEM? Any help would be greatly appreciated.

I have gone through several pipelines and tutorials online but I haven't been able to figure out whether the genome should first go through rsem-prepare-reference and then we should use the resulting reference_name.idx.fa to generate the STAR index.

alexdobin commented 2 years ago

Hi @etrh

Not sure if I can help here. I use STAR+RSEM pipeline without calling STAR from RSEM. Rather, I generate a genome index and map with STAR, and then use STAR's BAM as RSEM's input. This allows more flexibility.

Cheers Alex

etrh commented 2 years ago

Hi again @alexdobin

Thank you! This is extremely helpful information.

Incidentally, are you aware whether RSEM needs the BAI file alongside the transcriptome.bam? Or RSEM doesn't utilize the BAM index (BAI) at all?

alexdobin commented 2 years ago

Hi @etrh

the transcriptome.bam file that RSEM uses is not sorted by coordinate, so .bai file is not needed.

Best, Alex

edceeyuchen commented 1 year ago

Hi @alexdobin , I used the same pipeline with you, and I set the parameter-SortedByCoordinate ,then I got two types of BAM files , called .toTranscriptome.out.bam and .sortedByCoord.out.bam , which one do you used as the input file for RSEM? Eagerly looking forward to your help！ Thank you！

alexdobin commented 1 year ago

Hi @edceeyuchen

RSEM needs the *.toTranscriptome.out.bam file.

edceeyuchen commented 1 year ago

Sorry for my late reply!

And Thank you for your timely help! @alexdobin