OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

`SnpArrays.filter` `min_maf` keyword argument not working? #116

Closed mmkim1210 closed 2 years ago

mmkim1210 commented 2 years ago

In the below example, there are three SNPs that are non-polymorphic in the given set of samples, but these SNPs do not seem to be filtered out by the min_maf = 0.05 argument in SnpArrays.filter. Is this a bug?

using SnpArrays

kgp = SnpData(joinpath(pwd(), "kgp")) # 4,966 SNPs
colinds = SnpArrays.filter(kgp.snparray; min_maf = 0.05)[2]
sum(colinds) # 4,966

n_unique = Int[]
for i in 1:size(kgp.snparray, 2)
    push!(n_unique, length(unique(view(kgp.snparray, :, i))))
end
count(x -> x == 1, n_unique) # 3

kgp.zip

biona001 commented 2 years ago

It seems like all SNPs have MAF > 0.05 in your dataset

using SnpArrays
kgp = SnpData(joinpath(pwd(), "kgp")) # 4,966 SNPs
count(x -> x ≥ 0.05, maf(kgp.snparray)) # returns 4966
biona001 commented 2 years ago

Oh, it seems like this might be a bug. There are 3 SNPs that are all 1s but their maf is somehow 0.5.

using SnpArrays
kgp = SnpData(joinpath(pwd(), "kgp")) # 4,966 SNPs
x = convert(Matrix{Float64}, kgp.snparray)
idx = Int[]
for i in 1:size(kgp.snparray, 2)
    if length(unique(view(x, :, i))) == 1
        push!(idx, i)
    end
end
idx # 1229, 1230, 1231

mafs = maf(kgp.snparray)
mafs[idx] # all 0.5
mmkim1210 commented 2 years ago

yes, exactly!

kose-y commented 2 years ago

It sounds more like systematic variant calling problem (issue with the data) than a bug with the maf function.

julia> all(view(kgp.snparray, :, 1229) .== 0x02)
true

All the genotype values are 0x02, and it means that both alleles are equally represented for the variant. Both alleles have allele frequency 0.5, so MAF is also 0.5.

I have read from the PGEN format draft that it could be a sign of systematic variant calling problem if most of the calls are heterogeneous. Variants like these can be filtered based on "info score" or variance ratio (numeric variance divided by expected binomial variance p(1-p), where p is the allele frequency).

Hua-Zhou commented 2 years ago

It's an interesting case. I agree with @kose-y that the MAF (=0.5) is calculated correctly, thus shouldn't be filtered by min_maf = 0.05. As a temporary solution, users can create a mask vector manually and filter by the mask vector.