NCI-CGR / GwasQcPipeline

The CGR GWAS QC processing workflow.
https://nci-cgr.github.io/GwasQcPipeline/
Other
0 stars 3 forks source link

Optimize Memory Usage in IBD Calculation by Implementing `--snps-only` Flag #347

Closed jaamarks closed 2 weeks ago

jaamarks commented 1 month ago

The IBD step, particularly the sample_level_ibd rule, consumes a lot of memory. @shukwong found a discussion in the PLINK user group that highlights a key issue:

plink 1.9's main memory-consumption Achilles heel is long variant IDs: the data structure it uses to store variant IDs is essentially a matrix with <# of variants> on one side, and <length of longest variant ID + 1> on the other.

Given this info, we would like to experiment with removing the indels in the sample_level_ibd rule and assess its impact on memory consumption.

Current IBD calculation code:

    shell:
        "sleep 10 && plink "
        "--bed {input.bed} "
        "--bim {input.bim} "
        "--fam {input.fam} "
        "--genome full "
        "--min {params.ibd_min} "
        "--max {params.ibd_max} "
        "--threads {threads} "
        "--memory {resources.mem_mb} "
        "--out {params.out_prefix}"

proposed code changes: Add --snps-only flag:

    shell:
        "sleep 10 && plink "
        "--bed {input.bed} "
        "--bim {input.bim} "
        "--fam {input.fam} "
        "--genome full "
        "--min {params.ibd_min} "
        "--max {params.ibd_max} "
        "--threads {threads} "
        "--memory {resources.mem_mb} "
        "--snps-only"
        "--out {params.out_prefix}"

How to compare memory usage

To evalulate the effect of the --snps-only flag on memory consumption, we can record the jobid for the sample_level_ibd rule with and without the flag. Then run the following command to compare the max memory usage: sacct -j <job_id> --format=JobID,MaxRSS

jaamarks commented 1 month ago

Evaluating --snps-only

Comparing results of the GwasQcPipeline with and without the --snps-only flag for the IBD calculation. We looked at the memory usage when running the pipeline on an example dataset with N=500 and another with N=90K (roughly).


N=500 sample

Without --snps-only

JobID            MaxRSS    Elapsed
------------ ---------- ----------
1797915.bat+      3760K   00:00:15
36407 variants loaded from .bim file.
36407 variants and 451 people pass filters and QC.
Excluding 1050 variants on non-autosomes from IBD calculation.

With --snps-only

JobID            MaxRSS    Elapsed
------------ ---------- ----------
1844898.bat+      3464K   00:00:16
36407 variants loaded from .bim file.
36407 variants and 451 people pass filters and QC.
Excluding 1050 variants on non-autosomes from IBD calculation.

Note: MaxRSS can vary slightly between runs with the same data due to a variety of factors. These small variations are normal.




N=90k sample

Without --snps-only

JobID            MaxRSS    Elapsed
------------ ---------- ----------
1765211.bat+  97326996K   05:54:05
37706 variants loaded from .bim file.
37706 variants and 90989 people pass filters and QC.
Excluding 1055 variants on non-autosomes from IBD calculation.

With --snps-only

JobID            MaxRSS    Elapsed
------------ ---------- ----------
1844962.bat+  97336296K   05:52:49
37709 variants loaded from .bim file.
37709 variants and 90988 people pass filters and QC.
Excluding 1073 variants on non-autosomes from IBD calculation.
rajwanir commented 1 month ago

@jaamarks

:bulb: I just got curious so tested out a few things. If you mask all IDs in the bim file with a serial numeric IDs 1..n variants, it possibly could reduce the memory usage by ~50%.

Let samples.bim.default be the standard version containing all IDs and indels:

1 rs9651229 0 632287 T C 1 rs9701872 0 632828 C T 1 rs11497407 0 633147 A G

And samples.bim a pseudo version where IDs are replaced with simple numeric numbers:

1 1 0 632287 T C 1 2 0 632828 C T 1 3 0 633147 A G

The bed and fam file is unmodified constant.

Here are some test results with following command: plink --bed samples.bed --bim <bim file> --fam samples.fam --genome full --min 0.12 --max 1 --threads 8 --out samples.genome

With samples.bim.default , you get 77580maxresident k

With samples.bim , you get 34756maxresident K

The output file samples.genome.genome is identical by md5sum regardless of the ID format.

The --snps-only flag increases the memory usage if anything in my observation.

If you get a chance, take a look at this approach and see if it helps reduces the memory usage. Thanks

jaamarks commented 1 month ago

I experimented with this approach and found just a small change in memory usage, though not enough to justify making any changes to the code base.

I used a 90k sample as my test case. I ran the job with rsID and without rsID (substituted with serial integers starting at 1).

So only a 9,764Kb difference in max memory usage and no appreciable change in compute time.

rajwanir commented 1 month ago

Thanks @jaamarks for giving it a shot.