how to pick the number of signature based on the index.

ShixiangWang / sigminer

🌲 An easy-to-use and scalable toolkit for genomic alteration signature (a.k.a. mutational signature) analysis and visualization in R https://shixiangwang.github.io/sigminer/reference/index.html

https://shixiangwang.github.io/sigminer/

Other

144 stars 18 forks source link

how to pick the number of signature based on the index. #428

Closed songeric1107 closed 1 year ago

songeric1107 commented 1 year ago

I used this function to extract the mutation signatures based on MAF files.

e2 <- bp_extract_signatures(signature.syn, n_nmf_run = 50)

syn_estimate.55 copy.pdf

I feel like there is not a good condense signature for this dataset. what's your option?

ShixiangWang commented 1 year ago

A signature number around 25 is a good point as the result converges. You can try some other approaches in the sigminer to explore the proper signature number.

songeric1107 commented 1 year ago

thank you for your quick response, why 25 is a good point? based on silhouette consense score? or other index. how to use those index to determine the best representative? Why not 11? I see a drop after 11 based on cophenetic score? Will more signatures overfit the results? Thank you

ShixiangWang commented 1 year ago

Yeah, 11 is Ok. Based on the NMF survey plot, you can determine a proper signature number with your observations on some specific measure like cophenetic or silhoutee. However, you cannot automatically determine the signature number.

Try using sig_auto_extract with bayesian NMF or using bootstrapped NMF to get a more robust estimation of the matrix decomposition (https://shixiangwang.github.io/sigminer/articles/sigminer.html).

In general, we want the signatures we obtained are different from each other while keep less reconstruction error from the matrix decomposition.

songeric1107 commented 1 year ago

I did try the bayesian NMF methods, the suggested signature is 4 (proj2$suggested), I am not sure how that number is selected. if you check the consense plot, 4 signatures do not make sense to me. signature.syn.consense.pdf

songeric1107 commented 1 year ago

Sorry, I also have another question. If I have datasets from two groups, should I combine datasets together for mutation signature analysis? I could check the contributions difference between groups for any signature being identified. Or should I do signature selection separately? thanks

ShixiangWang commented 1 year ago

signature.syn.consense.pdf

Have you tried with a large initial signature in sig_auto_extract(), the default value is 25, which may not fit your data, you can take sample number - 1 to try.

ShixiangWang commented 1 year ago

Sorry, I also have another question. If I have datasets from two groups, should I combine datasets together for mutation signature analysis? I could check the contributions difference between groups for any signature being identified. Or should I do signature selection separately? thanks

For comparison purpose, combine the data is more recommended.

songeric1107 commented 1 year ago

thank you. I tried to use the a large initial signature =samplenumer-1, only 1 signature is returned.

mt_sig2 <- sig_auto_extract(signature.syn, K0 = 56, nrun = 30, strategy = "stable",cores=2)

Progress: ──────────────────────────────────── 100%Select Run 5, which K = 1 as the best solution.

ShixiangWang commented 1 year ago

That's truly strange. Could you show a subset of your data, like signature.syn[1:5, 1:5].

Also could you try

e1 <- bp_extract_signatures(signature.syn, range = 5:30)

bp_show_survey2(e1)

songeric1107 commented 1 year ago

signature.sy n[10:25, 1:5]

A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A

s1 0 0 0 0 0

s2 0 0 0 0 0

s3 0 0 0 0 0

s4 0 0 0 0 0

s5 0 0 0 0 1

s6 0 0 0 0 0

s7 0 0 0 0 0

s8 0 0 0 0 0

s9 0 0 0 0 0

s10 0 0 0 0 0

s11 0 0 0 0 0

s12 0 0 0 0 0

s13 0 0 0 0 0

s14 0 0 0 0 0

s15 0 0 0 0 0

s16 0 0 0 0 0

signature.sy	n[10:25,	1:5]
	A[C>A]A	A[C>A]C	A[C>A]G	A[C>A]T	C[C>A]A
s1	0	0	0	0	0
s2	0	0	0	0	0
s3	0	0	0	0	0
s4	0	0	0	0	0
s5	0	0	0	0	1
s6	0	0	0	0	0
s7	0	0	0	0	0
s8	0	0	0	0	0
s9	0	0	0	0	0
s10	0	0	0	0	0
s11	0	0	0	0	0
s12	0	0	0	0	0
s13	0	0	0	0	0
s14	0	0	0	0	0
s15	0	0	0	0	0
s16	0	0	0	0	0

songeric1107 commented 1 year ago

Sorry for another question. if I got two dropped points, which one should I pick? sig.all (dragged).pdf

songeric1107 commented 1 year ago

Meanwhile, if I would like to compare the signature difference between two groups, is the fisher test appropriate for comparing the exposure count between groups? thanks

ShixiangWang commented 1 year ago

Sorry for another question. if I got two dropped points, which one should I pick? sig.all (dragged).pdf

Based on the plot, you can try 7. And analyze if the obtained 7 mutational signatures could be well mapped to COSMIC reference signatures.

ShixiangWang commented 1 year ago

Meanwhile, if I would like to compare the signature difference between two groups, is the fisher test appropriate for comparing the exposure count between groups? thanks

If you categorize the signature exposure to a binary variable, use fisher test is good.

If you compare directly, just use wilcox.test.

songeric1107 commented 1 year ago

"Based on the plot, you can try 7. And analyze if the obtained 7 mutational signatures could be well mapped to COSMIC reference signatures."

--may I ask why chose 7, not 2?

--what exactly mean by "well mapped", similarity score comparing to the SBS reference?

ShixiangWang commented 1 year ago

For point 1, 7 keeps a high silhouette (stability) while a low reconstruction error.

For point 2, yes, in general, cosine similarity > 0.8 could be considered as well mapped.