AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Take "sufficiently non-zero" approach to exposures for CNS signature fitting #1192

Closed jaclyn-taroni closed 3 years ago

jaclyn-taroni commented 3 years ago

Purpose/implementation Section

What scientific question is your analysis addressing?

Still working on #1173. I've been thinking about #1100. Specifically, this quote:

Since we are just fitting known signatures to samples, all samples have a non-0 burden, and therefore bubble plots used in other circumstances aren't as interesting here (bubbles are usually sizes by % of samples containing signature, but everything here is "1").

I was vaguely aware that sometimes near zero values are set to zero when fitting mutational signatures — deconstructSigs::whichSignatures() uses an a cutoff of 0.06 as default.

sigfit — the package we use to fit the adult CNS signatures — will return "highest posterior density (HPD) interval for each signature exposure in each sample (HPD intervals are the Bayesian alternative to classical confidence intervals)" (ref).

Further quoting from the sigfit vignette here:

It is difficult for the model to make hard assignments of which signatures are present or absent due to the non-negative constraint on the estimate, which means that the range of values in the sample will not normally include zero. In practice, ‘sufficiently non-zero’ means that the lower end of the Bayesian HPD interval (see the previous section) is above a threshold value close to zero (by default 0.01, and adjustable via the thresh argument).

And lo and behold we've seen exposures with lower end of the interval < 0.01 set to zero before in this repo! https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/678/files#diff-83a49ed371610e147e8d2a5560157f05a013aaa4f2d67c81b5164da7e2687140R436

All of this to say, I've added a step to the CNS signature fitting where near zero (using the definition above) exposures are set to zero.

I am also skipping a step where samples with less than 50 mutations get filtered out. There are non-zero exposures for those samples, so it's not clear to me that sigfit is removing them? I also find the bubble plot to be a useful visualization and I think those samples should be included when we calculate the proportion of tumors a signature is present in.

Reproducibility Checklist

Documentation Checklist