Nik-Zainal-Group / signature.tools.lib

R package containing useful functions for mutational signature analysis
Other
80 stars 26 forks source link

Choosing the correct number of rearrangement signatures utilizing SilWid and norm.Error #37

Closed riazgillani closed 2 years ago

riazgillani commented 2 years ago

Globus_NBL_signature_extraction_4_5_22_with_200_repeatsSigs_OverallMetrics_TARGET_NBL_nboots20_MC

Hello - thank you for this very helpful tool. I am extracting de novo rearrangement signatures from a cohort of ~130 tumors with somatic structural variant calls (using 200 repeats and 20 bootstraps). My interpretation from this output is that there is a stable solution at 4 signatures, as this is the maximum value at which SilWid remains relatively stable.

Is this a reasonable interpretation? How would you recommend thinking about incorporating the norm.Error and norm.Error (orig. cat.) variables into determining the correct number of signatures, especially as both of these values seem high at 4 signatures and then diverge?

andreadega commented 2 years ago

Thanks for reaching out,

The norm.Error and norm.Error (orig cat) values are comparable and are scaled together so that the max of their combined set of values is equal to one. We normalised to max 1 so that their lines would fit this plot where all other metrics max to 1.

This said, these two errors compare the NMF solutions to either the bootstrapped catalogue or the original catalogue from which the bootstraps are sampled. You can then see that increasing the number of signatures will always improve the result with respect to the bootstrapped catalogue, but this may be also due to overfitting. The fact that the error with respect to the original catalogue increases means that overfitting has been reached already with 3 or 4 signatures. I would suggest to run again starting from 2 or even 1 signatures to see where exactly the inflection point of the norm Error (orig cat) is.

While the ASW seems high, the possible overfitting seems to indicate that there are perhaps only 2-3 signatures, which seems a little low for 130 samples. I suggest to have a look at the sample catalogues to have a sense of whether this is correct. It could be for example that a signature is present only in a very few samples, which would make it ok to have that additional signature at 4 while most other samples become somewhat overfitted. But it is just a guess.

On Mon, 11 Apr 2022, 19:17 riazgillani, @.***> wrote:

[image: Globus_NBL_signature_extraction_4_5_22_with_200_repeatsSigs_OverallMetrics_TARGET_NBL_nboots20_MC] https://user-images.githubusercontent.com/54368465/162801653-8b420e8e-ae99-4cd1-a8bf-4d9d710591e6.jpg

Hello - thank you for this very helpful tool. I am extracting de novo rearrangement signatures from a cohort of ~130 tumors with somatic structural variant calls (using 200 repeats and 20 bootstraps). My interpretation from this output is that there is a stable solution at 4 signatures, as this is the maximum value at which SilWid remains relatively stable.

Is this a reasonable interpretation? How would you recommend thinking about incorporating the norm.Error and norm.Error (orig. cat.) variables into determining the correct number of signatures, especially as both of these values seem high at 4 signatures and then diverge?

— Reply to this email directly, view it on GitHub https://github.com/Nik-Zainal-Group/signature.tools.lib/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTMMA7IW3ANFQJEC3R7JFTVERUC5ANCNFSM5TEARF7Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>