Open kose-y opened 3 years ago
Hi,
Thank you for your work around here!
Since this issue dates back a bit, I wanted to know about the plans for the package, I would be particularly interested in write/filtering functions for BGEN files. Something in the spirit of https://openmendel.github.io/SnpArrays.jl/latest/. Is that something you would have time to do?
@olivierlabayle The write/filtering features will be much welcomed.
@kose-y You may give an update of the status and plan for these two features?
These features are to be implemented when the need arises internally or externally. Among writing/filtering features, filtering variants is the simplest because the file is variant-by-variant compression. Filtering samples is trickier and may need stats to be computed online. It cannot be as efficient as what we do in SnpArrays.jl
and needs more consideration into it.
I will prioritize filtering variants, and start implementing it this week.
@olivierlabayle If you need to filter out samples, what would be the criteria? Directly providing a binary mask vector or using genotyping success rate is the simplest, and I can implement it. Please let me know if you need some other criteria. Also, please let me know if you need other writing features.
@kose-y Thank you for your quick reply!
Eventually I would like to filter variants and samples and then write to disk. I think filtering by providing a mask or list is a nice feature because in theory I you can do anything with it with some user defined logic.
@olivierlabayle I think that kind of feature can be completed in about 2-3 weeks.
@olivierlabayle filter function is added with v0.1.9.
@kose-y amazing thank you, will try that soon!
Hi, I have thus generated some filtered bgen file using the new feature and next tried to use plink2 (v2.00a3.7LM 64-bit Intel (24 Oct 2022)) on those files. It seems the compression is not supported by the tool, is that linked to the fact that BGEN.jl uses ZSTD? What would be the way forward?
plink2 --bgen filtered.ukb_53116_chr12.bgen ref-unknown --sample filtered.ukb_53116_chr12.sample --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions_hg19.txt
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink2.log.
Options in effect:
--bgen filtered.ukb_53116_chr12.bgen ref-first
--exclude range exclusion_regions_hg19.txt
--indep-pairwise 1000 50 0.05
--sample filtered.ukb_53116_chr12.sample
Start time: Wed Nov 2 15:35:04 2022
385228 MiB RAM detected; reserving 192614 MiB for main workspace.
Using up to 32 threads (change this with --threads).
--bgen: 28312 variants detected, format v1.3.
487207 samples imported from .sample file to plink2-temporary.psam .
Error: Invalid compressed SNP block in .bgen file.
I'm not sure at this point. Plink2 sometimes does nonstandard things with compression BGEN format. Could you please provide a working example with a public small dataset? I think one possibility is adding Gzip compression and seeing if the file is read correctly with plink2. I will add the option this week.
Sorry, I realize my previous comment doesn't help much. So I've tried with some examples provided in this repo and they seem to work fine. I don't know if the issue is related to plink2, I have also tried with qctoolv2 for the conversion to bed and I get a core dumped. Do you know if there is anything related to uk-biobank files that could corrupt the filtering process? Happy to try with another public dataset if you can give me other pointers.
Do you mean that the filtered files with the example data in this repo work OK with plink2 and qctoolv2?
Yes, this is right, I've done it with plink2 after filtering (example of the 10 first variants in the doc) and the program completes without error. I've done in for a variety of example files: 8, 16, 20, 32, 24, 32 bits.
I have finally suceeded in producing a minimal example using a file provided in this repo, you will need plink2 (version above) and Julia 1.8.
using BGEN
using Random
rng = MersenneTwister(0)
bgen = Bgen(BGEN.datadir("example.8bits.bgen"))
### Rand example
vidx = BitVector(rand(rng, Bool, bgen.header.n_variants))
sidx = BitVector(rand(rng, Bool, bgen.header.n_samples))
BGEN.filter("test.bgen", bgen, vidx, sidx)
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)
rm("test.bgen")
rm("test.sample")
### Even simpler example
vidx = zeros(Bool, bgen.header.n_variants)
vidx[[10, 20, 30]] .= 1
sidx = zeros(Bool, bgen.header.n_samples)
sidx[[10, 20, 30, 40, 50, 100]] .= 1
BGEN.filter("test.bgen", bgen, BitVector(vidx), BitVector(sidx))
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)
### Reading the filtered output with BGEN.jl seems to work fine
filtered_bgen = Bgen("test.bgen")
for v in iterator(filtered_bgen)
p = probabilities!(filtered_bgen, v)
println(size(p))
end
rm("test.bgen")
rm("test.sample")
So it seems the problem may occur if the masked SNPs/samples are not contiguous?
Note: that using the filtering examples provided in the documentation works fine.
@olivierlabayle Thanks for the reproducible example! I will look into it.
Did you manage to make progress on that by any chance?
Sorry for the delay. I will look into it this weekend.
@olivierlabayle I was able to reproduce the issue, but I think it will take a while to fix it. I will need to dig into Plink 2's source code to figure out where the current compression procedure and Plink's expectation for the Bgen format deviate.
I cannot find what I did wrong just based on the BGEN file format documentation. Plink might have some other constraints not listed in the format documentation.
@olivierlabayle The bug in the filter function is now fixed. Plink2 had a more rigid check on decompressed data length.
Thank you that's great, I will try again soon!
The following are the items raised so far.
[ ] select a variant by rsid or index within the region selected
[ ] getindex function
[ ] major/minor alleles for haplotypes
[x] ref allele dosage (avoid confusion with minor allele)
[x] minor allele frequency
[x] HWE test
[ ] comparison: parse timing vs. VCF format
[ ] import values in a matrix (#7)
[ ] write functions
[ ] expose interface for non-allocating probability reading functions
[ ] documentation: define RSID
[ ] documentation/code: pos -> position
[ ] allow integer arguments for chromosome selection
[ ]
show()
function forBgen
,Variant
,Genotypes
[ ] documentation: slash (
/
) for unphased, bar (|
) for phased genotypes and haplotypes[ ] documentation: clarify the explanation of
probabilities!(b, v)[i, j]
[ ] documentation: biallelic diploid examples then polyploid exceptions
[ ] same name for similar functionalities to
SnpArrays.jl