chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
415 stars 126 forks source link

`--keep` does not handle IID-only files properly when the main dataset contains FIDs #275

Closed dvg-p4 closed 4 months ago

dvg-p4 commented 4 months ago

Conditions

With a main dataset that contains FID information, do a --keep ids.txt operation, where the ids.txt file contains a single column of individual IDs, optionally with #IID header.

Expected behavior

The dataset will be filtered to only those IIDs, without regard to FID.

If the first line starts with '#FID' or '#IID', it will be treated as a header line. As long as the first columns are "#FID IID", "#FID IID SID", "#IID", or "#IID SID", PLINK 2 will do the right thing.

Observed behavior

No samples are matched, plink2 errors out with "Error: No samples remaining after main filters."

Full reprex

plink --dummy 10 10 --out input
# [...]
# Dummy data (10 people, 10 SNPs) written to input.bed + input.bim + input.fam .

head -n3 input.fam
# per0 per0 0 0 2 1
# per1 per1 0 0 2 2
# per2 per2 0 0 2 1

echo $'#IID\nper0\nper2\nper4' > ids.txt
cat ids.txt
# #IID
# per0
# per2
# per4

plink2 --bfile input --keep ids.txt --make-pgen --out output
# PLINK v2.00a5.12LM AVX2 Intel (25 Jun 2024)    www.cog-genomics.org/plink/2.0/
# (C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
# Logging to output.log.
# Options in effect:
#   --bfile input
#   --keep ids.txt
#   --make-pgen
#   --out output
# 
# Start time: Thu Jul 18 21:04:31 2024
# 380297 MiB RAM detected, ~273638 available; reserving 190148 MiB for main
# workspace.
# Using up to 96 threads (change this with --threads).
# 10 samples (10 females, 0 males; 10 founders) loaded from input.fam.
# 10 variants loaded from input.bim.
# 1 binary phenotype loaded (5 cases, 5 controls).
# --keep: 0 samples remaining.
# Error: No samples remaining after main filters.
# End time: Thu Jul 18 21:04:31 2024
dvg-p4 commented 4 months ago

Same issue if the input is a plink2 fileset:

$ plink2 --bfile input --make-pgen --out p2_input
[...]
Writing p2_input.psam ... done.
Writing p2_input.pvar ... done.
Writing p2_input.pgen ... done.
End time: Thu Jul 18 21:06:36 2024

$ plink2 --pfile p2_input --keep ids.txt --make-pgen --out output
PLINK v2.00a5.12LM AVX2 Intel (25 Jun 2024)    www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to output.log.
Options in effect:
  --keep ids.txt
  --make-pgen
  --out output
  --pfile p2_input

Start time: Thu Jul 18 21:07:04 2024
380297 MiB RAM detected, ~273639 available; reserving 190148 MiB for main
workspace.
Using up to 96 threads (change this with --threads).
10 samples (10 females, 0 males; 10 founders) loaded from p2_input.psam.
10 variants loaded from p2_input.pvar.
1 binary phenotype loaded (5 cases, 5 controls).
--keep: 0 samples remaining.
Error: No samples remaining after main filters.
End time: Thu Jul 18 21:07:04 2024
dvg-p4 commented 4 months ago

...it DOES work if there is no FID column in the main dataset input, though:

$ cp p2_input.pvar no_FID.pvar
$ cp p2_input.pgen no_FID.pgen
$ awk 'BEGIN {FS = "\t"; OFS = "\t"; printf "#"}; {print $2, $3, $4}' p2_input.psam > no_FID.psam
$ head -n3 no_FID.psam
#IID    SEX PHENO1
per0    2   1
per1    2   2

$ plink2 --pfile no_FID --keep ids.txt --make-pgen --out output
[...]
--keep: 3 samples remaining.
3 samples (3 females, 0 males; 3 founders) remaining after main filters.
1 case and 2 controls remaining after main filters.
Writing output.psam ... done.
Writing output.pvar ... done.
Writing output.pgen ... done.
chrchang commented 4 months ago

https://www.cog-genomics.org/plink/2.0/input#sample_id_convert

IID-only means FID is treated as 0.