Information on Synthetic datasets used in the Paper

Rashesh7 commented 7 years ago

Hello,

I was looking for the truth set for the 21 Breast cancer synthetic datasets that were used in the paper.

Also, can you please elaborate on how you generate the dataset with opportunities?

rvalieris commented 7 years ago

Hello Rashesh,

Thank you for your interest in signeR.

As briefly described in section 4.2 of the paper (https://doi.org/10.1093/bioinformatics/btw572), we adopted the signatures matrix P as being composed of four signatures described in Cosmic, commonly found in breast cancer genomes: signatures 1,2,3 and 13. Then we performed two different simulations based on the original mutation count matrix for the 21 breast cancers (M) and their original opportunity matrix (W):

1) Simulating data with opportunities: The exposure matrix E were obtained by maximizing the likelihood of observing M considering that the expected counts were given by PE°W (where ° means element wise matrix product). After that, we put aside M and considered P and E as the truth set, so we generated a new mutation count matrix (M2) using a Poisson distribution for each entry, with means given by PE°W. This matrix M2 and the original W were used as input for signeR and EMu algorithms, and results were compared with P and E to evaluate the performance of each method.

2) Simulating data without opportunities: We did the same as above, but disconsidering the opportunity matrix W. So, matrix E were obtained by maximizing the likelihood of M considering that expected counts were given by PE. Then we generated a new matrix M using Poisson distributions with means given by PE and used this as input for signeR and the method proposed by Alexandrov et al. Results were compared with P and E to evaluate the performance of each method.

I hope this makes things clear. If you have any doubt or other questions, please let me know.

Best regards, signeR team.

Rashesh7 commented 7 years ago

Hi Renan,

Thank you for the quick reply. Yes, that did help a lot. I am testing a few tools to implement it in our pipeline. SigneR is very useful in providing various outlooks into the data. I will contact you if I do have further questions.

Thank you.

Rashesh7 commented 7 years ago

Actually, I do have a few question. 1 ) In your experience how important is to provide opportunity data for predicting the Signatures. 2) For the Gastric dataset, did you generate the opportunities for each sample? 3) Did you run EMu without the opportunities? 4) As I mentioned, I am testing a few tools. I have simulated dataset of 120 samples with linear combination of the 3-4 signatures per sample. I was just wondering, if while running EMu should I use the default human-genome opportunity provided by EMu.

Thank you.

rvalieris commented 7 years ago

Hi Rashesh,

In answer to your questions:

1) In our experience, considering the opportunity in signature analysis can lead to slightly different results. But the concept of opportunity, first adopted by EMu, is a pretty reasonable normalization for the mutation matrix, so we believe that results obtained taking opportunity into account are more accurate. We encourage users to provide opportunities when using signeR.

2) Yes, but those were generated from the refgene of hg38.

3) No, we believe the full potential of both signer and Emu is achieved by taking the opportunity into account, so we compared those methods only in runs with opportunity. The simulations without opportunity were performed only to make a fair comparison between signeR and Alexandrov's method, since the latter does not take opportunity into account.

4) It is reasonable to use the default human genome opportunity while running EMu (or signeR) only if you have used it when simulating your dataset (i.e., if you simulated the dataset as P*EºO, were O is the default opportunity).

Regards, SigneR team

TojalLab / signeR

Information on Synthetic datasets used in the Paper #1