cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

RuntimeWarning in Weir & Cockerham Fst #309

Open alimanfoo opened 4 years ago

alimanfoo commented 4 years ago

Originally reported in the scikit-allel forum:


when trying to calculate FST between two populations from a VCF, for one or two of the chromosomes I get RuntimeWarnings like the following:

/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:228: RuntimeWarning: divide by zero encounte
red in true_divide
  a = ((n_bar / n_C) *
/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:233: RuntimeWarning: invalid value encounter
ed in multiply
  (h_bar / 4)))))
/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:182: RuntimeWarning: invalid value encounter
ed in true_divide
  n_C = (n_total - (np.sum(n**2, axis=1) / n_total)) / (r - 1)
/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:187: RuntimeWarning: invalid value encounter
ed in true_divide
  p = ac / an[:, np.newaxis, :]
/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:220: RuntimeWarning: divide by zero encountered in true_divide
  for allele in range(n_alleles)]
/opt/pyenv/versions/3.6.6/lib/python3.6/site-packages/allel/stats/fst.py:412: RuntimeWarning: invalid value encountered in double_scalars
  return np.nansum(wa) / (np.nansum(wa) + np.nansum(wb) + np.nansum(wc))

this only happens when using the Weir & Cockerham method, but not when using the Hudson method.

The call I'm using is:

allel.windowed_weir_cockerham_fst(
 pos, g, subpops=[pop1, pop2],
 size=size, step=step, start=1, max_allele=1)

The output seems to be fine and I don't readily see noticeable differences between the results of allel.windowed_weir_cockerham_fst and allel.windowed_hudson_fst (apart from the result values...).

I'd guess it would be safe to ignore these warngins but just wanted to check here if this is an expected behaviour or that something may be wrong with the VCF.


OK, I figured it might have to do with max_allele. Setting it to 2 does not throw any warnings, and parsing the VCF there are indeed some heterozygous sites for REF and a second alternative allele (genotypes were called in a large population set and subset to in smaller groups for analyses), for example:

Chr01 1703 2/2 2/2 2/2 0/2 2/2 2/2 0/2 0/2 0/2 0/2 0/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2

Subseting the VCF to only include biallelic sites solves this, although maybe would be worth to standardise the behaviour between allel.windowed_weir_cockerham_fst and allel.windowed_hudson_fst (if that makes sense)?