chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
414 stars 127 forks source link

Feature request: check for presence of e.g. --keep-fam file before converting to temporary psam/pvar/pgen #221

Closed sahwa closed 2 years ago

sahwa commented 2 years ago

When running the command:

plink2 \
        --bgen /well/ukbb-wtchg/v3/imputation/ukb_imp_chr${chr}_v3.bgen ref-unknown \
        --sample /well/ckb/shared/ukb_dataset_210403/ukb22828_c1_b0_v3_s487276.sample \
        --keep-fam ukb_eur_inds.txt \
        --remove-fam not_used_in_PCs_calc_UKB.txt \
        --freq \
        --out ukb_imp_chr${chr}_v3.unrelated

returns:

PLINK v2.00a3.1LM AVX2 Intel (19 May 2022)     www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukb_imp_chr1_v3.unrelated.log.
Options in effect:
  --bgen /well/ukbb-wtchg/v3/imputation/ukb_imp_chr1_v3.bgen ref-unknown
  --freq
  --keep-fam ukb_eur_inds.txt
  --out ukb_imp_chr1_v3.unrelated
  --remove-fam not_used_in_PCs_calc_UKB.txt
  --sample /well/ckb/shared/ukb_dataset_210403/ukb22828_c1_b0_v3_s487276.sample

Start time: Wed Aug 24 14:41:09 2022
773704 MiB RAM detected; reserving 386852 MiB for main workspace.
Using up to 48 threads (change this with --threads).
--bgen: 7402791 variants detected, format v1.2.
487409 samples imported from .sample file to
ukb_imp_chr1_v3.unrelated-temporary.psam .
--bgen: ukb_imp_chr1_v3.unrelated-temporary.pgen +
ukb_imp_chr1_v3.unrelated-temporary.pvar written.
487409 samples (264290 females, 222986 males, 133 ambiguous; 487409 founders)
loaded from ukb_imp_chr1_v3.unrelated-temporary.psam.
7402791 variants loaded from ukb_imp_chr1_v3.unrelated-temporary.pvar.
Note: No phenotype data present.
Error: Failed to open ukb_eur_inds.txt : No such file or directory.
End time: Wed Aug 24 15:10:41 2022

plink2 converts the .bgen file to temporary .psam/.pvar/.pgen files first, and then checks for the presence of the --keep-fam file after. For large datasets like the UKB imputed dataset, this takes a while (~20mins) and then fails after it's done all the converting.

I know it's my own silly fault for not checking first, but is it possible to change the order of precedence so that the faster operations like checking whether a file exists happen first and the program fails earlier?

Thanks.

chrchang commented 2 years ago

No, this would be too disruptive for others. Also, if you care even a little bit about speed, you should convert to .pgen first and then perform other operations on the .pgen; if you do that, --keep-fam will fail quickly enough.