[question] How to interpret the y-axis of exposure plots?

tobsecret commented 5 years ago

I'm looking at some exposures I have calculated but I'm a bit confused as to how I should interpret the y-axis. I got 5 signatures and for none of them I have any exposure values larger than 10%, so they're definitely not summing up to 1, so they're not indicative of the percentage of SNPs caused by a signature in an individual sample. But then again the individual exposure values are all below 1, so they are not indicative of the number of mutations caused by an individual signature in an individual sample, either. The vignette from the bioconductor page also shows really low values (8e-06) for the exposures - are these p-values? So not being a mathemagician, I am looking through the signeR paper now but honestly am at a bit of a loss as to how to interpret the y-axis on the exposure plots.

rvalieris commented 5 years ago

I forwarded this question to Rodrigo, here is his answer:

Thanks for your interest in signeR.

Exposures are really hard to interpret. First, lets consider an analysis without opportunity. In this case, if a sample has expositions e_i, i = 1,2,3..., we could say that the number of mutations generated by each signature i would be estimated by e_i * sum(P_i), where sum(P_i) means the sum of the entries forming the signature i. However, signatures were normalized so sum(P_i) = 1, and we can consider each e_i as the expected number of mutations caused by signature i on this sample.

However, calculations are more cumbersome when mutational opportunities are considered. In this case, if a sample has expositions e_i, i = 1,2,3..., and opportunity O, the number of mutations generated by each signature i would be estimated by e_i sum_j(P_ij O_j), 1<=j<=96. That is why we find really low values for exposures: opportunity values used are very high.

It may be easier to interpret the expected number of mutations generated by each signature in each sample: take the signature matrix P, transpose it and multiply by the vector of opportunities for the sample (considered as a column matrix with 96 entries). This will result in a column vector of n entries, where n is the number of signatures. Then multiply each of its entries by the respective exposure, and you will have the vector of expected mutation counts.

We are considering implement a function to plot this expected counts, instead of the raw exposures. They could be more easily understood, don't you agree?

I hope I clarified things, please let me know if you still have any issue.

Kind regards,

Rodrigo

tobsecret commented 5 years ago

Thanks Rodrigo, I had forgotten that I had done the correction using the opportunity matrix. I agree, it would make sense to have a convenience method for plotting the opportunity-corrected exposures, i.e. how many mutations one would expect for the given signature in each genome if the trinucleotide content was perfectly balanced.

tobsecret commented 5 years ago

I was also wondering why this produces error bars? Where do the error bars on each of the points come from? Is this the reconstruction error?

rvalieris commented 5 years ago

this is a direct result of the nature of the Bayesian method used. each matrix P and E (signatures and exposures) is sampled many times from their posterior distribution given data. at the end we have an array of many P and E matrices, in which each entry is combined into a single boxplot on the plot, the median is the consensus and the error bars shows their variability.

tobsecret commented 5 years ago

Ooooh, that makes a ton of sense! thanks so much for the explanation :pray:

TojalLab / signeR

[question] How to interpret the y-axis of exposure plots? #10