PalamaraLab / PrepareDecoding

Tool to compute decoding quantities
GNU General Public License v3.0
0 stars 0 forks source link

Understanding CSFS Samples and Discretization in ASMC's Prepare Decoding Tool #13

Open 7JVST opened 8 months ago

7JVST commented 8 months ago

Hi there,

I'm using the C++ compiled version of Prepare Decoding to create decoding quantities files for fastSMC, focusing on analyzing IBD segments. I've got demo files from ASMC_data and frequency files made from my own dataset including 1600 samples and around 500,000 variants. I used disc file from the one included in package "input30-100-2000.disc".

When I tried setting 'CSFSsamples=1600' to match the sample count, I ran into a memory issue causing a core dump. However, lowering 'CSFSsamples' to 300 fixed the problem.

I'm curious about the actual meaning of 'CSFS samples' counts. Do they need to match the sample count in the frequency file or the '.haps', '.samples', and '.map' files which will be used in fastSMC analysis later (n = 1600)? Also, is there a maximum limit for 'CSFS samples' counts?

Additionally, I'd like to know how to define my own number of quantiles for discretization in the C++ version. I noticed Python version allows user to define discretization like this: discretization=[[30.0, 15], [100.0, 15], 39]. Can you tell me how to do this in the C++ version?

fcooper8472 commented 8 months ago

Hi,

The CSFS samples parameter is used in the model to compute some probabilities in the HMM and there isn't much of a benefit to setting it to more than 300. Larger values also lead to higher computational costs, so best to just set it to 300.

Re: discretization, that syntax is only available through the preparedecoding Python tool: https://pypi.org/project/asmc-preparedecoding/

See example here: https://github.com/PalamaraLab/ASMC_dev/blob/main/notebooks/asmc_w_decodingquant.ipynb

There currently isn't a C++ implementation available.