churchill-lab / emase

Expectation-Maximization algorithm for Allele-Specific Expression
http://churchill-lab.github.io/emase/
GNU General Public License v3.0
21 stars 13 forks source link

Running EMASE #1

Open bdeonovic opened 9 years ago

bdeonovic commented 9 years ago

Hi I am interested in running your software on some RNA-seq data. The documentation for how to run from command line is not very good. After I run:

prepare-emase -G ${REF_GENOME} -g ${REF_GTF} -o ${REF_DIR} -m --no-bowtie-index

The usage tells me to run:

prepare-emase -G ${GENOME1},${GENOME2} -g ${GTF1},${GTF2} \
              -s ${SUFFIX1},${SUFFIX2} -o ${EMASE_DIR}

but I am not sure what GENOME1,GENOME2,GTF1,GTF2,SUFFIX1,SUFFIX2 are.

Thanks

narayananr commented 9 years ago

Hi

Thanks for trying out EMASE and sorry that the documentation is not clear. prepare-emase can be run in two ways. Can you please explain, how you are trying to use EMASE.

In the first example, prepare-emase takes a genome sequence (haploid) as fasta file and annotation as gtf file and extracts the set of all transcript sequences. (and also length of each transcripts

trans1 ATGC trans2 ATGCTAGC

In the second case prepare-emase can take multiple genomes and GTF files with suffix names followed by "_" as input and creates pooled transcriptome. For diploid genomes, the genome sequences correspond to maternal and paternal genomes (and GTF files) and the suffixes correspond to names of haplotypes that is used to differentiate the haplotypes in the genomes, GTF files, and the diploid transcriptomes.

Hope it helps. Narayanan

bdeonovic commented 9 years ago

Ultimately I would like to get the output noted at the bottom of the usage documentation:

‘run-emase’ outputs the following files:

${OUTBASE}.isoforms.effective_read_counts
${OUTBASE}.isoforms.tpm
${OUTBASE}.genes.effective_read_counts
${OUTBASE}.genes.tpm

which steps do I need to follow to get to these results?

narayananr commented 9 years ago

You may find this useful to get the results. http://emase.readthedocs.org/en/latest/usage.html

Thanks Narayanan

bdeonovic commented 9 years ago

That is the documentation that I have been referencing. It is confusing. Such as:

run-emase -i ${EMASE_FILE} -g ${GROUP_FILE} -L ${TINFO_FILE} -M ${MODEL} -o ${OUTBASE} \
          -r ${READLEN} -p ${PSEUDOCOUNT} -m ${MAX_ITERS} -t ${TOLERANCE}

What is ${MODEL} or ${PSEUDOCOUNT}?

narayananr commented 9 years ago

EMASE has four EM models for dealing with multimapped reads and we are testing them now. We recommend using model 4 by specifying "-M 4".

Pseudocount option enables bayesian estimation of allele specificity, which we have not tested extensively. Please use zero pseudocount by specifying "-p 0".

Narayanan

bdeonovic commented 9 years ago

Thank you for providing the details. I understand the software is still in development. I appreciate the support.