OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

SNP Simulation Method #126

Closed BrendonChau closed 6 months ago

BrendonChau commented 1 year ago

Implemented fast sampling for SNPs directly in the compressed format used by SnpArrays. This uses minimum allocations and should be useful for downstream simulations. Sampling is pretty fast, on my machine, it only takes around 90 seconds to simulate 100k SNPs from 500k subjects. Multithreading should be possible, but it's already fast enough for most use cases.

kose-y commented 1 year ago

Thanks, looks nice! Could you please add some unit tests?

codecov-commenter commented 1 year ago

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (918d294) 85.78% compared to head (706c5ef) 86.47%. Report is 4 commits behind head on master.

Files Patch % Lines
src/simulation.jl 91.26% 20 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #126 +/- ## ========================================== + Coverage 85.78% 86.47% +0.68% ========================================== Files 15 16 +1 Lines 1597 1826 +229 ========================================== + Hits 1370 1579 +209 - Misses 227 247 +20 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

BrendonChau commented 12 months ago

I've implemented a method for simulating missing SNP data as well and put in some basic unit tests for checking that the proportions are correct.

kose-y commented 12 months ago

Oh, one more thing. Could you please add the docs about the new functions? The sources are in Jupyter Notebooks in the docs directory.

BrendonChau commented 7 months ago

I've implemented a way to simulate SNPs under assuming a AR1 LD structure with constant ρ, to my knowledge, there is no scalable method for doing this for tens of thousands of SNPs. On my machine, simulating 100_000 SNPs from 500_000 subjects only takes a little over 4 minutes and can be done entirely in-memory. Being able to simulate SNPs with linkage disequilibrium is important for downstream methods development.

I also updated the docs to include examples, but there is some problems with the sections related to ADMIXTURE, could someone take a look?

kose-y commented 6 months ago

Hi, @BrendonChau I will review it next week.