OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

Edge case for function summarize() #4

Closed ericsobel closed 8 years ago

ericsobel commented 8 years ago

A trivial edge case issue: In either summarize() function if m = 0 (i.e., the SnpArray is empty), then each calculation of maf includes a divide by zero. I suggest simply making those statements conditional on m > 0.

Hua-Zhou commented 8 years ago

A good point. I’ll fix it.

On Jul 8, 2016, at 1:03 PM, Eric Sobel notifications@github.com wrote:

A trivial edge case issue: In either summarize() function if m = 0 (i.e., the SnpArray is empty), then each calculation of maf includes a divide by zero. I suggest simply making those statements conditional on m > 0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/issues/4, or mute the thread https://github.com/notifications/unsubscribe/AEwHsSKr5U3uGSkADgIvTP2rwVqqXpV2ks5qTq0QgaJpZM4JIWhh.

Hua-Zhou commented 8 years ago

One a second thought, the cases m=0 or a column having all missing genotypes do produce NaN for maf. Try

s = SnpArray(0, 5)
summarize(s)

gives

([NaN,NaN,NaN,NaN,NaN],Bool[false,false,false,false,false],[0,0,0,0,0],Int64[])

This is a sensible answer to me: maf cannot be calculated in these cases. Is this better to keep the current code?

ericsobel commented 8 years ago

I see your point. Of course if there are no genotypes, then which is the minor allele is also unknown (rather than the "allele2" implied by maf == NaN). I don't have a strong feeling about it, but I'd think with no genotypes, the minor (and the major) allele frequency should be 0.0 (since the count of alleles is zero), and the minor_allele boolean could be true or false. Edge cases certainly can be ambiguous. If you want to leave the code as is, I'm OK with that. (I've now rewritten Ken's code where he used summarize with a possibly empty SnpArray.)

Hua-Zhou commented 8 years ago

I would follow the convention for regular array:

a = randn(0, 5)
mean(a)

produces

 NaN  NaN  NaN  NaN  NaN

That means no change to the current summarize functions, which output NaN for maf if input SnpArray has 0 rows or some columns have all missing genotypes.