cancer-genomics / gemini_wflow

10 stars 0 forks source link

How to define fixed bin set ? #3

Open IamksGEEK opened 1 year ago

IamksGEEK commented 1 year ago

Hi Could you please guide me on obtaining the fixed bin set? It's a critical component for validating the GEMINI model on an independent cohort. However, I couldn't find this vital information in either your code or your paper. While reviewing the related paper for this code, I came across the explanation in Supplementary Figure 9.a, which seems to hint at the meaning of the fixed bin set used in the fixed GEMINI model. I've quoted the relevant portion here:

Supplymentary Figure9. Genome-wide fixed bins utilized for analysis of single molecule mutation frequencies and detection of lung cancer in cfDNA. a, Precent similarity of bins identified as being enriched for mutations in lung cancer and non-cancer samples in each training fold compared to the sets of bins utilized in the fixed model that were identified from analyses of all samples.

Does this imply that the fixed bin set consists of bins with the largest regional difference for all samples, specifically in the LUCAS cohort (n=365)? I appreciate any assistance you can provide. Thank you for your help!

Best regards, Kongshuang

yeyup commented 12 months ago

Hi, lamksGEEK. I believe that the fixed bin sets obtained from the LUCAS cohort were generated using all training samples (n=110) without employing the leave-one-out validation strategy. I attempted this method on my independent dataset but it yielded poor results. I am still working on it to investigate if there might be any other factors affecting the outcome.

IamksGEEK commented 12 months ago

Hi, lamksGEEK. I believe that the fixed bin sets obtained from the LUCAS cohort were generated using all training samples (n=110) without employing the leave-one-out validation strategy. I attempted this method on my independent dataset but it yielded poor results. I am still working on it to investigate if there might be any other factors affecting the outcome.

Hi, yeyup. I am very glad to hear your reply! I used all the training samples to build the model, but the accuracy of the model I obtained was also very poor, which is a very disappointing result :(. And I found that the number of mutations is influenced by factors such as sequencing depth, which leads to significant fluctuations in the number of mutations between samples. Now I have given up analyzing mutations from a quantitative perspective and started analyzing mutation signatures (mutsignatures ) which have greater performance on our data. Wishing you get the desired results in your experiment.