churchill-lab / emase

Expectation-Maximization algorithm for Allele-Specific Expression
http://churchill-lab.github.io/emase/
GNU General Public License v3.0
21 stars 13 forks source link

F1 hybrid docs ? #4

Closed colindaven closed 6 years ago

colindaven commented 8 years ago

Hi,

thanks for a nice looking program. I'd be keen to try out the approach on F1 hybrids, but the documentation is not complete for this. Do you have any hints ?

So far I have created diploid pseudogenomes with my own program. I would like to realign RNA-seq reads this with STAR (complex plant genome, bowtie insufficient).

Thanks, Colin

narayananr commented 8 years ago

Hi Thanks for the interest in using EMASE. Our current version of EMASE only supports aligning to transcriptome(s). Do you have annotations for the diploid genome? Then, the steps for F1 hybrids is exactly the same as exmple given for human data. http://emase.readthedocs.org/en/latest/examples.html#estimating-allele-specific-expression-from-human-rna-seq-data

colindaven commented 8 years ago

Yes, we have annotations.

I thought the method for F1 would be different, since this assumption is certainly not fulfilled, ie. we have no phased variant info, and certainly not for the whole population:

"

  1. Build an individualized genome

We assume there is a vcf file that contains phased variant information for every sample of your population. Unless we know which allele is M(aternal) or P(aternal), we are going to distinguish two alleles with suffices, L(eft) and R(ight). We also recommend to use different ${SAMPLE_DIR} for each sample: "

Thanks

narayananr commented 8 years ago

Sorry, I assumed that you have parental genomes (with their genetic variations). How did you create diploid pseudogenomes?

colindaven commented 8 years ago

My fault, I do.

I have trios, with rna-seq from the parents and the F1. I am pretty sure I don't have phased variant info from the whole population (wouldn't this be a similar dataresource to the human Hapmap for example) ?

Anyway, I tried the prepare step, and had a nasty gtf error. Seems my GTFs do not have exon_number as your ensembl mouse GTFs do. ` SBK4v2 SBKv2con_gmap exon 9583833 9584107 100 - . gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap exon 9582471 9582811 100 - . gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap exon 9581356 9581460 100 - . gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap exon 9581219 9581266 100 - . gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap exon 9579406 9579910 100 - . gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap CDS 9582471 9582743 100 - 0 gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap CDS 9581356 9581460 100 - 0 gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap CDS 9581219 9581266 100 - 0 gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1"; SBK4v2 SBKv2con_gmap CDS 9579671 9579910 100 - 0 gene_id "g2634.t1.path1"; transcript_id "g2634.t1.mrna1"; gene_name "g2634.t1";

`

narayananr commented 8 years ago

So, there is no parental genome sequence data available? Is one of the parents' reference genome available? In the case, it is not ideal, but one call variants using parental RNA-seq data and use it build diploid genome/transcriptome. I assume that is what you did for getting the diploid F1 genomes. Can you share the GTF file, I will try to see what the problem is.

Thanks

colindaven commented 8 years ago

Thanks for your reply!

So, there is no parental genome sequence data available? Is one of the parents' reference genome available?

Parental genomic resequencing data is available, but not of sufficient quality to build decent contigs or indeed a reference genome.

In the case, it is not ideal, but one call variants using parental RNA-seq data and use it build diploid genome/transcriptome. I assume that is what you did for getting the diploid F1 genomes.

I exported parental SNPs in VCF from genomic requencing and used this to build the diploid genome only (dipl. transcriptome build failed).

Can you share the GTF file, I will try to see what the problem is.

Apologies, I can't share it for this example. I have a second example with public data I might be able to share in the future - I may get back to you on this one but don't want to take up too much of your time.

narayananr commented 8 years ago

sure. no problem