how to subset the bed file based on individuals ID in Jupyter

Uljibuh commented 3 years ago

Hi, I am working with plink file using SnpArrays.jl package. here is what my plink file and dataframe (A) looks like

`SnpData(people: 28960, snps: 45807, snp_info: │ Row │ chromosome │ snpid │ genetic_distance │ position │ allele1 │ allele2 │ │ │ String │ String │ Float64 │ Int64 │ String │ String │ ├─────┼────────────┼────────────────────────┼──────────────────┼──────────┼─────────┼─────────┤ │ 1 │ 1 │ BovineHD0100000015 │ 0.0 │ 36337 │ G │ A │ │ 2 │ 1 │ Hapmap43437-BTA-101873 │ 0.0 │ 135098 │ G │ A │ │ 3 │ 1 │ BovineHD0100000062 │ 0.0 │ 206470 │ C │ T │ │ 4 │ 1 │ ARS-BFGL-NGS-16466 │ 0.0 │ 267940 │ T │ C │ │ 5 │ 1 │ BTA-34880 │ 0.0 │ 347418 │ T │ C │ │ 6 │ 1 │ BovineHD0100000096 │ 0.0 │ 348331 │ C │ A │ …,

person_info: │ Row │ fid │ iid │ father │ mother │ sex │ phenotype │ │ │ Abstrac… │ Abstract… │ Abstract… │ Abstract… │ Abstrac… │ Abstract… │ ├─────┼──────────┼───────────┼───────────┼───────────┼──────────┼───────────┤ │ 1 │ 0 │ 409859435 │ 400005850 │ 411102034 │ 2 │ -9 │ │ 2 │ 0 │ 409922125 │ 400005850 │ 411657369 │ 2 │ -9 │ │ 3 │ 0 │ 411075330 │ 400005356 │ 407723032 │ 2 │ -9 │ │ 4 │ 0 │ 412057132 │ 400005972 │ 410308103 │ 2 │ -9 │ │ 5 │ 0 │ 404693736 │ 400003797 │ 404050484 │ 2 │ -9 │ │ 6 │ 0 │ 404880845 │ 400004013 │ 403616839 │ 2 │ -9 │ …, srcbed: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.bed srcbim: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.bim srcfam: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.fam )`

dataframe (A) `

	iid
	Int64?
1	409202388
2	412463440
3	412444675
4	412402431
5	410364148
6	410578480
7	410039003
8	412342024
9	408654254
10	412703861
11	409542275
12	408954632
13	412670401
14	410573099
15	410540339
16	412396052
17	412331677
18	412070775
19	412325434
20	412544272
21	412438935
22	411558820
23	412726302
24	409063372
25	412494701
26	411869659
27	410179751
28	409018969
29	412706174
30	409633183
⋮	⋮

` now I want to exclude the individuals in (A ) from the plink files (cowdata) based on their ids and save it as a subset bed file. how to do it ? thanks

kose-y commented 3 years ago

We have the keyword argument f_person in the function SnpArrays.filter for exactly this purpose. Assuming cowdata is the variable for SnpData, you can do

SnpArrays.filter(cowdata; des="cowdata_in_A", f_person = x -> parse(Int, x[:iid]) in A.iid)

for selecting iids in the DataFrame A, and

SnpArrays.filter(cowdata; des="cowdata_not_in_A", f_person = x -> !(parse(Int, x[:iid]) in A.iid))

for selecting iids not in A.

kose-y commented 3 years ago

parse is used because A is in Int64, while we read iid, fid in fam files as a String.

Uljibuh commented 3 years ago

Thank you, it worked :)

OpenMendel / SnpArrays.jl

how to subset the bed file based on individuals ID in Jupyter #85