Features to add - Githubissues

kose-y commented 3 years ago

The following are the items raised so far.

[ ] select a variant by rsid or index within the region selected
[ ] getindex function
[ ] major/minor alleles for haplotypes
[x] ref allele dosage (avoid confusion with minor allele)
[x] minor allele frequency
[x] HWE test
[ ] comparison: parse timing vs. VCF format
[ ] import values in a matrix (#7)
[ ] write functions
[ ] expose interface for non-allocating probability reading functions
[ ] documentation: define RSID
[ ] documentation/code: pos -> position
[ ] allow integer arguments for chromosome selection
[ ] show() function for Bgen, Variant, Genotypes
[ ] documentation: slash (/) for unphased, bar (|) for phased genotypes and haplotypes
[ ] documentation: clarify the explanation of probabilities!(b, v)[i, j]
[ ] documentation: biallelic diploid examples then polyploid exceptions
[ ] same name for similar functionalities to SnpArrays.jl

olivierlabayle commented 2 years ago

Hi,

Thank you for your work around here!

Since this issue dates back a bit, I wanted to know about the plans for the package, I would be particularly interested in write/filtering functions for BGEN files. Something in the spirit of https://openmendel.github.io/SnpArrays.jl/latest/. Is that something you would have time to do?

Hua-Zhou commented 2 years ago

@olivierlabayle The write/filtering features will be much welcomed.

@kose-y You may give an update of the status and plan for these two features?

kose-y commented 2 years ago

These features are to be implemented when the need arises internally or externally. Among writing/filtering features, filtering variants is the simplest because the file is variant-by-variant compression. Filtering samples is trickier and may need stats to be computed online. It cannot be as efficient as what we do in SnpArrays.jl and needs more consideration into it.

I will prioritize filtering variants, and start implementing it this week.

@olivierlabayle If you need to filter out samples, what would be the criteria? Directly providing a binary mask vector or using genotyping success rate is the simplest, and I can implement it. Please let me know if you need some other criteria. Also, please let me know if you need other writing features.

olivierlabayle commented 2 years ago

@kose-y Thank you for your quick reply!

Eventually I would like to filter variants and samples and then write to disk. I think filtering by providing a mask or list is a nice feature because in theory I you can do anything with it with some user defined logic.

kose-y commented 2 years ago

@olivierlabayle I think that kind of feature can be completed in about 2-3 weeks.

kose-y commented 1 year ago

@olivierlabayle filter function is added with v0.1.9.

olivierlabayle commented 1 year ago

@kose-y amazing thank you, will try that soon!

olivierlabayle commented 1 year ago

Hi, I have thus generated some filtered bgen file using the new feature and next tried to use plink2 (v2.00a3.7LM 64-bit Intel (24 Oct 2022)) on those files. It seems the compression is not supported by the tool, is that linked to the fact that BGEN.jl uses ZSTD? What would be the way forward?

plink2 --bgen filtered.ukb_53116_chr12.bgen ref-unknown --sample filtered.ukb_53116_chr12.sample --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions_hg19.txt
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bgen filtered.ukb_53116_chr12.bgen ref-first
  --exclude range exclusion_regions_hg19.txt
  --indep-pairwise 1000 50 0.05
  --sample filtered.ukb_53116_chr12.sample

Start time: Wed Nov  2 15:35:04 2022
385228 MiB RAM detected; reserving 192614 MiB for main workspace.
Using up to 32 threads (change this with --threads).
--bgen: 28312 variants detected, format v1.3.
487207 samples imported from .sample file to plink2-temporary.psam .

Error: Invalid compressed SNP block in .bgen file.

kose-y commented 1 year ago

I'm not sure at this point. Plink2 sometimes does nonstandard things with compression BGEN format. Could you please provide a working example with a public small dataset? I think one possibility is adding Gzip compression and seeing if the file is read correctly with plink2. I will add the option this week.

olivierlabayle commented 1 year ago

Sorry, I realize my previous comment doesn't help much. So I've tried with some examples provided in this repo and they seem to work fine. I don't know if the issue is related to plink2, I have also tried with qctoolv2 for the conversion to bed and I get a core dumped. Do you know if there is anything related to uk-biobank files that could corrupt the filtering process? Happy to try with another public dataset if you can give me other pointers.

kose-y commented 1 year ago

Do you mean that the filtered files with the example data in this repo work OK with plink2 and qctoolv2?

olivierlabayle commented 1 year ago

Yes, this is right, I've done it with plink2 after filtering (example of the 10 first variants in the doc) and the program completes without error. I've done in for a variety of example files: 8, 16, 20, 32, 24, 32 bits.

olivierlabayle commented 1 year ago

I have finally suceeded in producing a minimal example using a file provided in this repo, you will need plink2 (version above) and Julia 1.8.

using BGEN
using Random

rng = MersenneTwister(0)
bgen = Bgen(BGEN.datadir("example.8bits.bgen"))

### Rand example

vidx = BitVector(rand(rng, Bool, bgen.header.n_variants))
sidx = BitVector(rand(rng, Bool, bgen.header.n_samples))
BGEN.filter("test.bgen", bgen, vidx, sidx)
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

rm("test.bgen")
rm("test.sample")

### Even simpler example

vidx = zeros(Bool, bgen.header.n_variants)
vidx[[10, 20, 30]] .= 1

sidx = zeros(Bool, bgen.header.n_samples)
sidx[[10, 20, 30, 40, 50, 100]] .= 1

BGEN.filter("test.bgen", bgen, BitVector(vidx), BitVector(sidx))
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

### Reading the filtered output with BGEN.jl seems to work fine

filtered_bgen = Bgen("test.bgen")

for v in iterator(filtered_bgen)
    p = probabilities!(filtered_bgen, v)
    println(size(p))
end

rm("test.bgen")
rm("test.sample")

So it seems the problem may occur if the masked SNPs/samples are not contiguous?

Note: that using the filtering examples provided in the documentation works fine.

kose-y commented 1 year ago

@olivierlabayle Thanks for the reproducible example! I will look into it.

olivierlabayle commented 1 year ago

Did you manage to make progress on that by any chance?

kose-y commented 1 year ago

Sorry for the delay. I will look into it this weekend.

kose-y commented 1 year ago

@olivierlabayle I was able to reproduce the issue, but I think it will take a while to fix it. I will need to dig into Plink 2's source code to figure out where the current compression procedure and Plink's expectation for the Bgen format deviate.

kose-y commented 1 year ago

I cannot find what I did wrong just based on the BGEN file format documentation. Plink might have some other constraints not listed in the format documentation.

kose-y commented 1 year ago

@olivierlabayle The bug in the filter function is now fixed. Plink2 had a more rigid check on decompressed data length.

olivierlabayle commented 1 year ago

Thank you that's great, I will try again soon!

OpenMendel / BGEN.jl

Features to add #8