andrej-fischer / EMu

An Expectation-Maximization algorithm to infer mutational signatures
http://genomebiology.com/content/14/4/R39
GNU General Public License v3.0
25 stars 8 forks source link

Error running large no. of samples & spectra #2

Open yangchoo opened 10 years ago

yangchoo commented 10 years ago

Hi, first off, thanks for the awesome program!

Everything works fine for hundreds of samples, but I'm running into periodic determinant = 0 errors while running EMu with a large number of samples (~6000). [Err msg: In get_llhood for m=6***, det(Hf) = 0.... ]

The program works fine up to ~15 spectra, then such errors start occuring periodically. I am thus unable to run EMu to completion for anything beyond 15 spectra.

Let me know if you need my .mutations or .opp. file.

Thanks!

andrej-fischer commented 10 years ago

Hello,

this error might appear, if you have very few mutations in a specific sample. If you try more signatures than there are channels occupied in a sample, the log-likelihood contribution for that sample cannot be computed (line 785-803 in MutSpec.cpp).

I have a look for a workaround. But do you really want that many signatures?

On 30 Jun 2014, at 06:41, yangchoo wrote:

Hi, first off, thanks for the awesome program!

Everything works fine for hundreds of samples, but I'm running into periodic determinant = 0 errors while running EMu with a large number of samples (~6000). [Err msg: In get_llhood for m=6***, det(Hf) = 0.... ]

The program works fine up to ~15 spectra, then such errors start occuring periodically. I am thus unable to run EMu to completion for anything beyond 15 spectra.

Let me know if you need my .mutations or .opp. file.

Thanks!

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

yangchoo commented 10 years ago

Ah.. I see. Does that mean the contribution of a particular signature from a sample has to be non-zero? I am trying to compare EMu vs. NMF on a large dataset. NMF has been shown by Alexandrov to resolve ~27 signatures from his dataset, and I am trying to see if EMu can detect similar signatures from a similarly large dataset.

andrej-fischer commented 10 years ago

Hi yangchoo,

thanks again for your question. It turns out it is aimed at the very heart of the EMu model. The technical reason for the error is described above, but underlying is the assumption that all the processes are, in principle, present in all the samples. The case that some processes are strictly absent, i.e. their activity is zero, is not well handled with the current implementation. That is mainly due to a saddle point approximation which is used to calculate the log-likelihood, but is not well defined for zero activity. The immediate fix of this bug will take a bit time and testing, but will certainly be worth it. In the meantime, one option is to separate samples by cancer types, which was also done by Alexandrov et al.