diversity stats and fully missing sites

mufernando commented 4 years ago

A new paper came out describing a new tool, pixy. The main goal of this new tool is to produce unbiased estimates of pi and dxy.

scikit-allel is included in their comparison, and they show that measures of pi and dxy produced by scikit-allel are downwardly biased with a high rate of missing data.

Below is a figure from their paper describing the source of biases. As far as I understand, scikit-allel does NOT fail to properly account for missing sites in the mean pairwise difference step (see here). For fully missing sites, it would return NaN and not 0.

The problem lies in how scikit-allel computes the per site metrics of diversity. Scikit-allel divides mpd by the number of accessible sites in a window, but does not discount fully missing sites from the denominator. i.e., as in the figure, above Case 2, it would divide mpd by 12 and not 9.

This seems to be easily fixed if when counting the number of bases n_bases we simply subtracted the number of fully missing sites. Something like this would do:

an = np.sum(ac, axis=1)
n_fully_missing = np.count_nonzero(an == 0)
n_bases = n_bases - n_fully_missing

Did I miss something? Does the proposed correction sound reasonable?

mufernando commented 4 years ago

One could argue that fully missing sites should be included in the is_accessible mask, but I can see users not being aware of that.

alimanfoo commented 4 years ago

Hi @mufernando, thanks for raising this. I think you're right, in general the is_accessible mask ought to deal with this problem, in the sense that it should be extremely rare that a site is classified as accessible but is fully missing (otherwise it's a pretty poor accessibility mask :-). However, there is a chance that fully missing sites might not be accounted for in the is_accessible mask, particularly if the accessibility mask was designed based on one cohort of samples and now you are measuring diversity in a different cohort. Therefore a correction may be needed, and I think it would be as simple as you propose.

mufernando commented 4 years ago

I still haven't figured out the bias with respect to proportion of missing genotypes. In Figure 3 they tested what would happen if all sites had x% missing genotypes. It is not clear why their approach is not biased and scikit-allel is in that treatment.

any insights @petrelharp or @alimanfoo?

alimanfoo commented 4 years ago

I still haven't figured out the bias with respect to proportion of missing genotypes. In Figure 3 they tested what would happen if all sites had x% missing genotypes. It is not clear why their approach is not biased and scikit-allel is in that treatment.

any insights @petrelharp or @alimanfoo?

Sorry, nothing immediately obvious. Perhaps as you approach high levels of genotype missingness, you also start to get some site missingness? Paper says it's simulated data though, so doesn't fully make sense. Code doesn't do anything special as far as I can see, just counts number of 0s and 1s in each row of genotype calls, treating any other allele call (<0 or >1) as missing.

mufernando commented 4 years ago

at first I thought that was the case, but they say that they simulate all sites with fixed x% of sites missing, so it can't be that. would love to see the simulation code, but that doesn't seem available.

alimanfoo commented 4 years ago

at first I thought that was the case, but they say that they simulate all sites with fixed x% of sites missing, so it can't be that

Right, it's odd.

petrelharp commented 4 years ago

I don't see what would be off other than when sites start to drop out entirely. I think your fix is correct, @mufernando, and it should create a pattern like in their Figure 3. However, at the sample sizes they have (50 diploids), I think the downward bias should start a lot later, so I don't know what's up with that. I'm also confused about why the methods are not identical in Figure 2.

petrelharp commented 4 years ago

Oh, never mind - Figure 3 makes sense: it's actually 25 diploids in each population, and problems occur when a site is entirely absent in either population (explaining why the dropoff starts earlier for dxy).

ksamuk commented 4 years ago

Hi all! Just chiming in here as @mufernando and @petrelharp contacted my co-author and I on Twitter.

I've tracked down what I think is the source of the issue here. In short, the missing genotypes were not being simulated in the way we thought. The missingness level was in fact being enforced across the whole genotype matrix, and not at the level of individual sites. So, this means that the missing level is the average missingness across sites. As such, at higher levels of missingness, this can result in sites having 100% missing data. So, the source of the bias is likely as you suspected, and is probably being caused by sites with 100% missing data.

We had actually chosen to simulate missing data in this way because we felt it was more similar to how missingness would manifest in a real dataset. It does result in a blurring of the lines between missing genotypes and sites, however. I misread some of my notes in my code (both methods are there) and that resulted in the wrong explanation making its way into the manuscript.

We're going to fix the explanation of how the missing genotypes data is simulated in the next revision of the manuscript, and include an explanation of why the bias is appearing for high levels of missingness. Sorry for the confusion here, and we appreciate you guys flagging this for us.

As an aside, it is not clear to me how scikit-allel's diversity functions are handling sites that are represented in the VCF, but have 100% missing genotypes. I suppose if those could be flagged with an accessibility mask, it might not matter.

I am still working on getting all the simulated data online, but in the meantime, I am attaching a link to a zipped folder containing three simulated VCFs: one with no missing data, one with 50% and one with 90%. The latter two are derived from the first. I also included my R script for creating the sparse missing data if you are interested.

Link is here: https://www.dropbox.com/s/0bn7e8k0s5o8tmw/simulated_missing_genotypes.zip?dl

Thanks again for flagging this and apologies again for the mix-up. I should also say that we both really like scikit-allel and really appreciate all the work put into maintaining it and making it a great toolkit!

Cheers,

Kieran

ksamuk commented 4 years ago

Just to be clear, I do not think there is evidence of a bug in how scikit-allel is computing diversity stats per se. However, the use of an accessibility mask is extremely important for producing unbiased estimates of pi and dxy (this is basically the main message of our paper, and one of the main things that our utility provides for the user).

petrelharp commented 4 years ago

Oh, good! Glad to hear we know what's going on.

it is not clear to me how scikit-allel's diversity functions are handling sites that are represented in the VCF, but have 100% missing sites.

It's just that scikit-allel doesn't deal with this properly, as you have pointed out! But, it'd be very easy to fix that - see @mufernando's proposed fix above. Perhaps the fix can be included in your revised manuscript, somehow? I don't know who or how, but open to suggestions.

mufernando commented 4 years ago

Thanks for tracking this down @ksamuk! I'd be happy to put in a PR for this. Should be real quick.

~~My fix was incomplete. We need to discount the fully missing from the denominator AND change the fill=0 in calls to mpd with fill=np.nan.~~ https://github.com/cggh/scikit-allel/blob/821997d35fc3b991ae4f6dedc4afd0033cc45ad1/allel/stats/diversity.py#L276-L277

An alternative way of fixing this would be to change the fill=0 to fill=np.nan in the calls to mpd and discount np.count_nonzero(np.isnan(mpd)) from n_bases. A little more elegant, IMO.

Ah, it would also be good to add some tests that cover fully missing sites!

cggh / scikit-allel

diversity stats and fully missing sites #324