OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

added snpdata, splitting and merging #25

Closed kose-y closed 5 years ago

kose-y commented 5 years ago

I tried to implement #24.

I'm new to Julia, so I might have done something considered unconventional or inefficient in Julia.

dmbates commented 5 years ago

May I suggest using CSV.jlor IndexedTables.jl to read the .bim and .fam files? Both of these are well-maintained packages that allow more flexible specifications for the files to be read than does DelimitedFiles. In this case CSV may be more appropriate if you want to produce a DataFrame.

As an example

julia> using CSV, DataFrames

julia> const SNP_INFO_KEYS = [:chromosome, :snpid, :genetic_distance, :position, :allele1, :allele2]

julia> snp_info = categorical!(CSV.read("data/EUR_subset.bim", delim='\t', header=SNP_INFO_KEYS, types=[Int8,String,Float64,Int,String,String]), [:allele1, :allele2])
54051×6 DataFrame
│ Row   │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│       │ Int8       │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├───────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1     │ 17         │ rs34151105  │ 0.0              │ 1665     │ T            │ C            │
│ 2     │ 17         │ rs143500173 │ 0.0              │ 2748     │ T            │ A            │
│ 3     │ 17         │ rs113560219 │ 0.0              │ 4702     │ T            │ C            │
│ 4     │ 17         │ rs1882989   │ 5.6e-5           │ 15222    │ G            │ A            │
│ 5     │ 17         │ rs8069133   │ 0.000499         │ 32311    │ G            │ A            │
│ 6     │ 17         │ rs112221137 │ 0.000605         │ 36405    │ G            │ T            │
│ 7     │ 17         │ rs34889101  │ 0.00062          │ 36975    │ A            │ C            │
│ 8     │ 17         │ rs35840960  │ 0.000668         │ 38827    │ T            │ A            │
│ 9     │ 17         │ rs144918387 │ 0.000775         │ 42965    │ C            │ T            │
│ 10    │ 17         │ rs62057022  │ 0.000948         │ 49640    │ G            │ A            │
│ 11    │ 17         │ rs4890182   │ 0.000949         │ 49663    │ C            │ T            │
│ 12    │ 17         │ rs1882990   │ 0.001001         │ 51696    │ C            │ T            │
│ 13    │ 17         │ rs62057050  │ 0.001573         │ 65610    │ G            │ T            │
│ 14    │ 17         │ rs8081881   │ 0.002141         │ 78176    │ A            │ G            │
│ 15    │ 17         │ rs11150892  │ 0.002271         │ 80772    │ C            │ T            │
│ 16    │ 17         │ rs34314694  │ 0.002351         │ 82381    │ C            │ T            │
│ 17    │ 17         │ rs4890198   │ 0.002392         │ 83196    │ C            │ G            │
│ 18    │ 17         │ rs182915197 │ 0.002506         │ 85472    │ T            │ C            │
│ 19    │ 17         │ rs148130198 │ 0.002508         │ 85522    │ C            │ T            │
kose-y commented 5 years ago

@dmbates Thanks for the comment. This looks better. I will update it tomorrow.

Hua-Zhou commented 5 years ago

@kose-y @dmbates, thanks for the nice contribution!

@kose-y Can you add documentation for the SnpData type and split-merge functionalities to /docs/SnpArrays.ipynb? Package documentation will be generated from this notebook.

Further thought: When working with Biobank data, the sample size is up to 1 million. Analysts are splitting the Plink file into regions even much smaller than a chromosome, because for large chromosomes like 1 and 21 the Plink files can still be too large for current analysis software to handle. We may need to think about a unifying interface for subsetting, splitting, and merging SnpData either along SNPs or individuals.

kose-y commented 5 years ago

@Hua-Zhou Sure. I will update it sometime next week.

Hua-Zhou commented 5 years ago

@kose-y I understand packages do not need to include the Manifest.toml file. Julia should figure out manifest file from Project.toml alone. Let me know if I'm wrong.