eturro / mmseq

Haplotype, isoform and gene level expression analysis using multi-mapping RNA-seq reads
GNU General Public License v2.0
67 stars 20 forks source link

No *.mmseq output #34

Open weishwu opened 5 years ago

weishwu commented 5 years ago

Hi,

I've run MMSeq on my samples. For only a few of them MMSeq output the .mmseq files. But for most of them these outputs (.mmseq, .genes.mmseq, and .identical.mmseq) are missing while the _gibbs.gz, .k and .M files are there. There is no error message and the end of the log is as below:

Found 116641 transcripts in 51528 transcript combinations. Transposing M (this may take a while)...done (83 seconds). Counting unique hits to sets of identical transcripts...done. Counting unique hits to genes...done. EM iteration 513, log likelihood ratio: 0.09979
Gibbs iteration 16383

What's the problem? How can I get the *.mmseq files?

Thanks,

eturro commented 5 years ago

Hi,

This is indicative of a crash. Perhaps you ran out of memory. Does re-running mmseq resolve this? If specific samples are causing this problem time over time, are you able to identify any distinguishing features of the samples that fail?

On 6 Sep 2019, at 13:57, weishwu notifications@github.com wrote:

Hi,

I've run MMSeq on my samples. For only a few of them MMSeq output the .mmseq files. But for most of them these outputs (.mmseq, .genes.mmseq, and .identical.mmseq) are missing while the _gibbs.gz, .k and .M files are there. There is no error message and the end of the log is as below:

Found 116641 transcripts in 51528 transcript combinations. Transposing M (this may take a while)...done (83 seconds). Counting unique hits to sets of identical transcripts...done. Counting unique hits to genes...done. EM iteration 513, log likelihood ratio: 0.09979 Gibbs iteration 16383

What's the problem? How can I get the *.mmseq files?

Thanks,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

weishwu commented 5 years ago

Thanks for your prompt reply! Is there a way to give mmseq more memory or make it use less memory? I didn't find a parameter in mmseq for memory usage. Since it still output "-_mmseq.tracegibbs.gz" and "-_mmseq.gene.tracegibbs.gz" for every sample and there are 1024 posterior sample values, does this mean the Gibbs iteration was completed successfully and I could calculate log_mu from the posterior samples by myself?

Thanks,

eturro commented 5 years ago

Memory usage is not something you can control - mmseq uses no more than the memory it needs. If the trace_gibbs files are complete, that suggests the crash happened right at the very end, when the summaries are computed. You can in principle compute the log_mu from these files (but make sure you check your code works on other complete outputs), although it would be preferable to understand why the crash is happening.

On 6 Sep 2019, at 18:21, weishwu notifications@github.com wrote:

Thanks for your prompt reply! Is there a way to give mmseq more memory or make it use less memory? I didn't find a parameter in mmseq for memory usage. Since it still output "_mmseq.trace_gibbs.gz" and "_mmseq.gene.trace_gibbs.gz" for every sample and there are 1024 posterior sample values, does this mean the Gibbs iteration was completed successfully and I could calculate log_mu from the posterior samples by myself?

Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

weishwu commented 5 years ago

I was able to compute log_mu from the gibbs posterior samples. However now I need to run mmcollapse for which I have to get the .mmseq outputs. I've rerun mmseq but for the same samples that failed before mmseq still crashed. I'm redoing something differently in the upstream and will see if it affects mmseq. In the meanwhile I just want to ask whether it is possible to create those .mmseq outputs (or just the parameters needed by mmcollapse) from the gibbs outputs. For each sample I have these outputs:

mmseq.identical.trace_gibbs.gz

mmseq.prop.trace_gibbs.gz

mmseq.gene.trace_gibbs.gz

mmseq.trace_gibbs.gz

I don't have but hope to get:

mmseq.mmseq

mmseq.gene.mmseq

eturro commented 5 years ago

Hi

This is getting complicated... you're trying to re-implement a substantial part of mmseq. Would you like me to try to generate the mmseq files myself from your fasta and bam file on a high-memory machine? You can email me directly

Ernest

On 3 Oct 2019, at 19:14, weishwu notifications@github.com wrote:

I was able to compute log_mu from the gibbs posterior samples. However now I need to run mmcollapse for which I have to get the .mmseq outputs. I've rerun mmseq but for the same samples that failed before mmseq still crashed. Is it possible to create those .mmseq outputs from the gibbs outputs? For each sample I have these outputs:

mmseq.identical.trace_gibbs.gz

mmseq.prop.trace_gibbs.gz

mmseq.gene.trace_gibbs.gz

mmseq.trace_gibbs.gz

I don't have:

mmseq.mmseq

mmseq.gene.mmseq

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

weishwu commented 5 years ago

Hi Ernest,

Thanks a lot for being willing to troubleshoot it for me. I've changed something upstream and am running indexing and alignment right now. Once I get the bam2hits outputs, I'll send you one of them to test. I've created diploid transcriptome fasta files based on Gencode v31 comprehensive annotations. Is this OK?

In the meanwhile, there is one fundamental question I want to ask:

My understanding of the modeling in MMSeq is that it first sums up the counts of reads for each set as described in Figure 2 of the paper, and then builds a Poisson model using the count data. Since we are using MMSeq to analyze ASE, I'm referring to panel (b) of Figure 2. Figure 2b shows an example that has only one SNP. In most cases we have multiple SNPs in a transcript. Since MMSeq takes the sum of reads, does this mean the SNPs with higher read counts will have a heavier effect on the final outputs? The reason why I'm asking this is that we are observing variation of SNP-level ASE amplitude. Our current approach is to calculate allelic-depth ratio at SNP level and then take a sort of average across all the SNPs in a transcript. In this way we are basically eliminating the effect of difference of total read counts at different SNPs. Will SNPs with higher read coverage tend to dominate the final ASE measurement in MMSeq?

Thanks,

eturro commented 5 years ago

You may include any transcript sequences in your FASTA file that you like, as long as the headers are properly formatted. I suggest not including multiple transcripts with exactly the same sequence, however.

In the context of ASE, the transcripts can have any number of SNPs - it really doesn't matter. MMSEQ will give you a coherent expression estimate for each transcript, irrespective of the level of sequence sharing they may have with other transcripts. When there is a lot of sequence similarity between a transcript and other transcripts, the posterior uncertainty will likely be greater than when a transcript has a very distinct sequence (so that most if not all reads that map to it do not map to any other transcript).

On 7 Oct 2019, at 22:09, weishwu notifications@github.com wrote:

Hi Ernest,

Thanks a lot for being willing to troubleshoot it for me. I've changed something upstream and am running indexing and alignment right now. Once I get the bam2hits outputs, I'll send you one of them to test. I've created diploid transcriptome fasta files based on Gencode v31 comprehensive annotations. Is this OK?

In the meanwhile, there is one fundamental question I want to ask:

My understanding of the modeling in MMSeq is that it first sums up the counts of reads for each set as described in Figure 2 of the paper, and then builds a Poisson model using the count data. Since we are using MMSeq to analyze ASE, I'm referring to panel (b) of Figure 2. Figure 2b shows an example that has only one SNP. In most cases we have multiple SNPs in a transcript. Since MMSeq takes the sum of reads, does this mean the SNPs with higher read counts will have a heavier effect on the final outputs? The reason why I'm asking this is that we are observing variation of SNP-level ASE amplitude. Our current approach is to calculate allelic-depth ratio at SNP level and then take a sort of average across all the SNPs in a transcript. In this way we are basically eliminating the effect of difference of total read counts at different SNPs. Will SNPs with higher read coverage tend to dominate the final ASE measurement in MMSeq?

Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

weishwu commented 5 years ago

So if the paternal transcript and the maternal transcript have exactly the same sequence (because there is no hetSNP in this transcript to distinguish between them), I should not put it as transcript_A and transcript_B, but instead I should put a single transcript in the transcriptome fasta file, am I right?

As for my second question, I think I was asking about a different issue. I was wondering if the SNPs that have higher read counts will affect the final expression estimate more than the SNPs with lower read counts. Let me take an extreme example, suppose there are two hetSNPs (i and j) in a transcript, and SNP i has paternal allele read count = 80, and maternal allele read count =20, while SNP j has paternal allele read count = 2, and maternal allele read count =8. If I look at log2(pat/mat) at SNP level, it will be 2 at SNP i and -2 at SNP j, so they basically will offset each other, and thus there won't be allelic-biased expression for the entire transcript. However, in MMSeq approach, we will probably still have paternal allele-biased expression, because there are much more read counts at SNP i, so the ASE at SNP i will tend to dominate the ASE of the entire transcript. Is this true?

eturro commented 5 years ago

Yes, that's right - best exclude duplicate sequences.

Your example is a little contrived (because it assumes there is a huge variability in the Poisson rate across the transcript, which you'd hope wouldn't be the case if you've used random fragmentation and hexamer amplication) but, assuming reads do not align to any other transcripts, I think the answer is yes, as it should, because overall there is strong evidence that paternal expression > maternal expression.

On 8 Oct 2019, at 14:43, weishwu notifications@github.com wrote:

So if the paternal transcript and the maternal transcript have exactly the same sequence (because there is no hetSNP in this transcript to distinguish between them), I should not put it as transcript_A and transcript_B, but instead I should put a single transcript in the transcriptome fasta file, am I right?

As for my second question, I think I was asking about a different issue. I was wondering if the SNPs that have higher read counts will affect the final expression estimate more than the SNPs with lower read counts. Let me take an extreme example, suppose there are two hetSNPs (i and j) in a transcript, and SNP i has paternal allele read count = 80, and maternal allele read count =20, while SNP j has paternal allele read count = 2, and maternal allele read count =8. If I look at log2(pat/mat) at SNP level, it will be 2 at SNP i and -2 at SNP j, so they basically will offset each other, and thus there won't be allelic-biased expression for the entire transcript. However, in MMSeq approach, we will probably still have paternal allele-biased expression, because there are much more read counts at SNP i, so the ASE at SNP i will tend to dominate the ASE of the entire transcript. Is this true?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

weishwu commented 5 years ago

Thanks!