[question] Can one normalize the mutation matrix by the opportunity matrix upfront and expect the same results as if one divided mut by?

TojalLab / signeR

https://bioconductor.org/packages/signeR

13 stars 3 forks source link

[question] Can one normalize the mutation matrix by the opportunity matrix upfront and expect the same results as if one divided mut by? #13

Closed tobsecret closed 5 years ago

tobsecret commented 5 years ago

I am asking this because I want to combine data from two closely related species with wildly different AT contents, in order to figure out if there are any shared mutational signatures between them. Is it functionally the same to input mut and opp into signeR vs inputting only mut, but dividing the rows by opp?

The opportunities are either known or set to W = 1, and hence regarded as fixed parameters. To simplify notation we omit any further reference to W. An expression for ℒ(θ;m) is presented as supplementary material by the equation (s1).

This makes me believe it is but I am not sure I am understanding this correctly.

rvalieris commented 5 years ago

No sorry this won't work, the algorithm assumes the mutation counts inputs are whole numbers, the opportunity matrix is used as weights in the decomposition of signatures x expositions.

However If I understood this correctly, you can use the opportunity matrix for this, each row of the opportunity matrix will be used for the respective sample on the same row of the count matrix, so you can have samples with different opportunities as long as each row is correctly set for each sample.

on the vignette, the genOpportunityFromGenome generates a full matrix for a single genome, but you generate a opp matrix for each species you have (set a specific number of samples on nsamples), and concatenate the rows of all opp matrices.

tobsecret commented 5 years ago

Aaaaah, yes that should work! Awesome, I'll try that, thanks a ton!

tobsecret commented 5 years ago

So let's say I have two different vcf files, with species 1 and 2, in the same naming scheme as the vignette, the following should work?

# first genome and vcf
target_regions1 <- import(con="/path/to/a/target1.bed", format="bed")
mygenome1 <- FaFile("/path/to/genome1.fasta")
vcfobj1 <- readVcf("/path/to/a/species1.vcf", mygenome1)
mut1 <- genCountMatrixFromVcf(mygenome1, vcfobj1)
opp1 <- genOpportunityFromGenome(mygenome1, target_regions1, nsamples=nrow(mut1))
# second genome and vcf
target_regions2 <- import(con="/path/to/a/target2.bed", format="bed")
mygenome2 <- FaFile("/path/to/genome2.fasta")
vcfobj2 <- readVcf("/path/to/a/species2.vcf", mygenome2)
mut2 <- genCountMatrixFromVcf(mygenome2, vcfobj2)
opp2 <- genOpportunityFromGenome(mygenome2, target_regions2, nsamples=nrow(mut2))
# combine muts and opps
mut <- rbind(mut1, mut2)
opp <- rbind(opp1, opp2)

rvalieris commented 5 years ago

yes that looks right !

tobsecret commented 5 years ago

This works, at least in my test on a subset of my data! Thanks again for the support!