Closed jaamarks closed 2 weeks ago
--snps-only
Comparing results of the GwasQcPipeline with and without the --snps-only
flag for the IBD calculation. We looked at the memory usage when running the pipeline on an example dataset with N=500 and another with N=90K (roughly).
--snps-only
JobID MaxRSS Elapsed
------------ ---------- ----------
1797915.bat+ 3760K 00:00:15
36407 variants loaded from .bim file.
36407 variants and 451 people pass filters and QC.
Excluding 1050 variants on non-autosomes from IBD calculation.
--snps-only
JobID MaxRSS Elapsed
------------ ---------- ----------
1844898.bat+ 3464K 00:00:16
36407 variants loaded from .bim file.
36407 variants and 451 people pass filters and QC.
Excluding 1050 variants on non-autosomes from IBD calculation.
Note: MaxRSS can vary slightly between runs with the same data due to a variety of factors. These small variations are normal.
--snps-only
JobID MaxRSS Elapsed
------------ ---------- ----------
1765211.bat+ 97326996K 05:54:05
37706 variants loaded from .bim file.
37706 variants and 90989 people pass filters and QC.
Excluding 1055 variants on non-autosomes from IBD calculation.
--snps-only
JobID MaxRSS Elapsed
------------ ---------- ----------
1844962.bat+ 97336296K 05:52:49
37709 variants loaded from .bim file.
37709 variants and 90988 people pass filters and QC.
Excluding 1073 variants on non-autosomes from IBD calculation.
@jaamarks
:bulb: I just got curious so tested out a few things. If you mask all IDs in the bim file with a serial numeric IDs 1..n variants, it possibly could reduce the memory usage by ~50%.
Let samples.bim.default
be the standard version containing all IDs and indels:
1 rs9651229 0 632287 T C 1 rs9701872 0 632828 C T 1 rs11497407 0 633147 A G
And samples.bim
a pseudo version where IDs are replaced with simple numeric numbers:
1 1 0 632287 T C 1 2 0 632828 C T 1 3 0 633147 A G
The bed and fam file is unmodified constant.
Here are some test results with following command:
plink --bed samples.bed --bim <bim file> --fam samples.fam --genome full --min 0.12 --max 1 --threads 8 --out samples.genome
With samples.bim.default
, you get 77580maxresident k
With samples.bim
, you get 34756maxresident K
The output file samples.genome.genome is identical by md5sum regardless of the ID format.
The --snps-only
flag increases the memory usage if anything in my observation.
If you get a chance, take a look at this approach and see if it helps reduces the memory usage. Thanks
I experimented with this approach and found just a small change in memory usage, though not enough to justify making any changes to the code base.
I used a 90k sample as my test case. I ran the job with rsID and without rsID (substituted with serial integers starting at 1).
97113172K
and a time of 03:12:16
.97103408K
and a time of 03:12:03
So only a 9,764Kb
difference in max memory usage and no appreciable change in compute time.
Thanks @jaamarks for giving it a shot.
The IBD step, particularly the
sample_level_ibd
rule, consumes a lot of memory. @shukwong found a discussion in the PLINK user group that highlights a key issue:Given this info, we would like to experiment with removing the indels in the
sample_level_ibd
rule and assess its impact on memory consumption.Current IBD calculation code:
proposed code changes: Add --snps-only flag:
How to compare memory usage
To evalulate the effect of the
--snps-only
flag on memory consumption, we can record the jobid for thesample_level_ibd
rule with and without the flag. Then run the following command to compare the max memory usage:sacct -j <job_id> --format=JobID,MaxRSS