Closed MathieuRita closed 5 years ago
- The total number of mutations vary between the samples
The total number of mutations can indeed affect the results of NMF extraction. Samples with higher number of mutations have more weight in the minimization. However, this most likely reflects the reality as we are more confident about the pattern of mutations in samples with higher numbers of mutations. For example, a sample with only a single C>T at TpCpT mutation can be observed simply by chance. In contrast, a sample with 1000 C>T at TpCpT mutations may not have exactly 1000 mutations but (assuming a Poisson distribution of the errors) we have a high confidence that it has between 950 and 1050 mutations. In principle, we do not recommend normalization of samples as this will make samples with low numbers of mutations and samples with high numbers of mutations equally possible. Instead, SigProfiler leverages Poisson resampling in each of its iterations.
- The dataset is slightly unbalanced
Indeed, this is a common problem in NMF. SigProfiler has a hierarchal version that allows analysis with underlying structure. Please use the hierarchal option.
- Is there a parameter that allows to add sparsity during the extraction
Currently, there is no sparsity as part of the de novo extraction of mutational signatures. Our prior experience with such penalties was that the results can be very dependent on the sparsity parameter. Further, we do not plan to implement such a parameter in the near future. However, we do have a sparsity parameter for assignment of mutational signatures. It reduces/removes the signature bleeding of the decomposed solution (i.e., when the de novo extracted signatures are matched with the known set of consensus signatures).
I am currently working on a signatures extraction for a neuroblastoma cancers dataset. The specificity of our dataset is its heterogeneity. Our samples (WGS) come from both untreated and treated patients. Therefore, the profiles I am working with are very different and lead to some theoretical issues.
1. The total number of mutations vary between the samples
Because I am working on untreated and treated samples, the mutational burden varies a lot between the samples. The result is that SigProfiler primarily minimizes the error of the samples with a high number of mutations. When I plot error_by_sample=f(total_number_of_mutations_by_sample), I get a strong correlation suggesting that the mutational burden biases the results. It is not a surprise looking both on the cost function and the update rules.
Potential solution and question: Since it is a work on mutational profiles, I have normalized all the samples in order to remove the mutational burden bias (giving the same number of mutations for each sample). Do you think it is a good idea ? Am I missing a biological aspect by doing this ?
2. The dataset is slightly unbalanced
The dataset I am working on is slightly unbalanced. The result is that it seems that the algorithm is focusing hierarchically on the parts of the dataset that are more represented. In particular, when looking on the signatures obtained for different values of the total number of mutations, it seems that the distribution of the dataset biases the extractions. Therefore, it is possible that, for a given value of K, I overfit a part of the dataset while underfitting the the other part.
Question: Do you think that in such a case, I can trust the optimal value of K ?
3. Is there a parameter that allows to add sparsity during the extraction
Because the dataset is heterogeneous, there are some signatures that are present only in a part of the dataset (the signatures associated tp a type of chemotherapy are only present in the treated samples). The problem is that these signatures modify the signatures that are present in the entire dataset. I observe a clear influence of these signatures when I perform an extraction on the entire dataset and an extraction only on the untreated samples.
Potential solution and question: One solution could be to add sparsity during the extraction (on the exposure). I have seen that you have created a parameter “penalty” to set up the degree of sparsity in the suggested solution, If I have understood well your code, this penalty is only applied during the fitting part of the algorithm. During the extraction, the cost functions minimized are an L2-norm or a KL-divergence without any sparsity penalty. Question: Am I right when I say that there is no parameter to add sparsity during the extraction ? Is there a way with this implementation to fight against the bleeding of signatures ?