Closed kose-y closed 5 years ago
May I suggest using CSV.jl
or IndexedTables.jl
to read the .bim
and .fam
files? Both of these are well-maintained packages that allow more flexible specifications for the files to be read than does DelimitedFiles
. In this case CSV
may be more appropriate if you want to produce a DataFrame
.
As an example
julia> using CSV, DataFrames
julia> const SNP_INFO_KEYS = [:chromosome, :snpid, :genetic_distance, :position, :allele1, :allele2]
julia> snp_info = categorical!(CSV.read("data/EUR_subset.bim", delim='\t', header=SNP_INFO_KEYS, types=[Int8,String,Float64,Int,String,String]), [:allele1, :allele2])
54051×6 DataFrame
│ Row │ chromosome │ snpid │ genetic_distance │ position │ allele1 │ allele2 │
│ │ Int8 │ String │ Float64 │ Int64 │ Categorical… │ Categorical… │
├───────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1 │ 17 │ rs34151105 │ 0.0 │ 1665 │ T │ C │
│ 2 │ 17 │ rs143500173 │ 0.0 │ 2748 │ T │ A │
│ 3 │ 17 │ rs113560219 │ 0.0 │ 4702 │ T │ C │
│ 4 │ 17 │ rs1882989 │ 5.6e-5 │ 15222 │ G │ A │
│ 5 │ 17 │ rs8069133 │ 0.000499 │ 32311 │ G │ A │
│ 6 │ 17 │ rs112221137 │ 0.000605 │ 36405 │ G │ T │
│ 7 │ 17 │ rs34889101 │ 0.00062 │ 36975 │ A │ C │
│ 8 │ 17 │ rs35840960 │ 0.000668 │ 38827 │ T │ A │
│ 9 │ 17 │ rs144918387 │ 0.000775 │ 42965 │ C │ T │
│ 10 │ 17 │ rs62057022 │ 0.000948 │ 49640 │ G │ A │
│ 11 │ 17 │ rs4890182 │ 0.000949 │ 49663 │ C │ T │
│ 12 │ 17 │ rs1882990 │ 0.001001 │ 51696 │ C │ T │
│ 13 │ 17 │ rs62057050 │ 0.001573 │ 65610 │ G │ T │
│ 14 │ 17 │ rs8081881 │ 0.002141 │ 78176 │ A │ G │
│ 15 │ 17 │ rs11150892 │ 0.002271 │ 80772 │ C │ T │
│ 16 │ 17 │ rs34314694 │ 0.002351 │ 82381 │ C │ T │
│ 17 │ 17 │ rs4890198 │ 0.002392 │ 83196 │ C │ G │
│ 18 │ 17 │ rs182915197 │ 0.002506 │ 85472 │ T │ C │
│ 19 │ 17 │ rs148130198 │ 0.002508 │ 85522 │ C │ T │
@dmbates Thanks for the comment. This looks better. I will update it tomorrow.
@kose-y @dmbates, thanks for the nice contribution!
@kose-y Can you add documentation for the SnpData
type and split
-merge
functionalities to /docs/SnpArrays.ipynb
? Package documentation will be generated from this notebook.
Further thought: When working with Biobank data, the sample size is up to 1 million. Analysts are splitting the Plink file into regions even much smaller than a chromosome, because for large chromosomes like 1 and 21 the Plink files can still be too large for current analysis software to handle. We may need to think about a unifying interface for subsetting, splitting, and merging SnpData either along SNPs or individuals.
@Hua-Zhou Sure. I will update it sometime next week.
@kose-y I understand packages do not need to include the Manifest.toml
file. Julia should figure out manifest file from Project.toml
alone. Let me know if I'm wrong.
I tried to implement #24.
I'm new to Julia, so I might have done something considered unconventional or inefficient in Julia.