OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

how to subset the bed file based on individuals ID in Jupyter #85

Closed Uljibuh closed 3 years ago

Uljibuh commented 3 years ago

Hi, I am working with plink file using SnpArrays.jl package. here is what my plink file and dataframe (A) looks like

`SnpData(people: 28960, snps: 45807, snp_info: │ Row │ chromosome │ snpid │ genetic_distance │ position │ allele1 │ allele2 │ │ │ String │ String │ Float64 │ Int64 │ String │ String │ ├─────┼────────────┼────────────────────────┼──────────────────┼──────────┼─────────┼─────────┤ │ 1 │ 1 │ BovineHD0100000015 │ 0.0 │ 36337 │ G │ A │ │ 2 │ 1 │ Hapmap43437-BTA-101873 │ 0.0 │ 135098 │ G │ A │ │ 3 │ 1 │ BovineHD0100000062 │ 0.0 │ 206470 │ C │ T │ │ 4 │ 1 │ ARS-BFGL-NGS-16466 │ 0.0 │ 267940 │ T │ C │ │ 5 │ 1 │ BTA-34880 │ 0.0 │ 347418 │ T │ C │ │ 6 │ 1 │ BovineHD0100000096 │ 0.0 │ 348331 │ C │ A │ …,

person_info: │ Row │ fid │ iid │ father │ mother │ sex │ phenotype │ │ │ Abstrac… │ Abstract… │ Abstract… │ Abstract… │ Abstrac… │ Abstract… │ ├─────┼──────────┼───────────┼───────────┼───────────┼──────────┼───────────┤ │ 1 │ 0 │ 409859435 │ 400005850 │ 411102034 │ 2 │ -9 │ │ 2 │ 0 │ 409922125 │ 400005850 │ 411657369 │ 2 │ -9 │ │ 3 │ 0 │ 411075330 │ 400005356 │ 407723032 │ 2 │ -9 │ │ 4 │ 0 │ 412057132 │ 400005972 │ 410308103 │ 2 │ -9 │ │ 5 │ 0 │ 404693736 │ 400003797 │ 404050484 │ 2 │ -9 │ │ 6 │ 0 │ 404880845 │ 400004013 │ 403616839 │ 2 │ -9 │ …, srcbed: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.bed srcbim: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.bim srcfam: C:\Users\wubu.julia\packages\SnpArrays\CL3iQ\src…\data\genotype99.fam )`

dataframe (A) `

  iid
  Int64?
1 409202388
2 412463440
3 412444675
4 412402431
5 410364148
6 410578480
7 410039003
8 412342024
9 408654254
10 412703861
11 409542275
12 408954632
13 412670401
14 410573099
15 410540339
16 412396052
17 412331677
18 412070775
19 412325434
20 412544272
21 412438935
22 411558820
23 412726302
24 409063372
25 412494701
26 411869659
27 410179751
28 409018969
29 412706174
30 409633183

` now I want to exclude the individuals in (A ) from the plink files (cowdata) based on their ids and save it as a subset bed file. how to do it ? thanks

kose-y commented 3 years ago

We have the keyword argument f_person in the function SnpArrays.filter for exactly this purpose. Assuming cowdata is the variable for SnpData, you can do

SnpArrays.filter(cowdata; des="cowdata_in_A", f_person = x -> parse(Int, x[:iid]) in A.iid)

for selecting iids in the DataFrame A, and

SnpArrays.filter(cowdata; des="cowdata_not_in_A", f_person = x -> !(parse(Int, x[:iid]) in A.iid))

for selecting iids not in A.

kose-y commented 3 years ago

parse is used because A is in Int64, while we read iid, fid in fam files as a String.

Uljibuh commented 3 years ago

Thank you, it worked :)