Suggestion to improve handling/io of mixed ploidy genotypes

timothymillar commented 5 years ago

This is a suggestion for a potential way to improve handling of mixed ploidy data in GenotypeArray.

Currently read_vcf pads all GenotypeArray rows to the length specified by numbers={'GT': ...} with -1 . This can lead to ambiguity when reading mixed ploidy VCF files because it's not clear if -1 indicates a partial genotype or a complete genotype of lower ploidy.

For example, if a VCF is read with numbers={'GT': 4} then the array [0,1,-1,-1] could be the partial tetraploid genotype '0/1/./.' or the complete diploid genotype '0/1'

A potential solution to this would be to pad the arrays with a different value i.e. -2 to indicate the ploidy of the genotype so that:

tetraploid '0/1/./.' = [0,1,-1,-1]
diploid '0/1' = [0,1,-2,-2]
triploid '0/1/.' = [0,1,-1,-2]

I'm not sure of the full implications of this suggestion but most of the core methods are easy to re-implement with this change for example an array with two tetraploids and a diploid:

>>> g = GenotypeArray([
...    [[1,0,0,0], [0,0,-2,-2], [0,1,-1,-1]],
...    [[0,0,0,0], [0,0,-2,-2], [0,0,0,0]],
...    [[0,0,0,0], [0,-1,-2,-2], [0,0,1,1]],
...    [[0,1,2,2], [0,0,-2,-2], [0,2,2,2]],
...])

is_called:

>>> np.all(g.values != -1, axis=-1)
array([[ True,  True, False],
       [ True,  True,  True],
       [ True, False,  True],
       [ True,  True,  True]])

is_missing:

>>> np.any(g.values == -1, axis=-1)
array([[False, False,  True],
       [False, False, False],
       [False,  True, False],
       [False, False, False]])

loci-wise ploidy of each sample

>>> np.sum(g.values != -2, axis=-1)
array([[4, 2, 4],
       [4, 2, 4],
       [4, 2, 4],
       [4, 2, 4]])

timothymillar commented 5 years ago

I also wanted to mention that GenotypeAlleleCountsArray is good for working with mixed ploidy data but doesn't indicate partial genotypes (or doesn't differentiate them from a genotype of lower ploidy).

This can be worked around by masking out partial genotypes in a GenotypeArray before converting to GenotypeAlleleCountsArray but only if all genotypes in the GenotypeArray are the same ploidy.

alimanfoo commented 5 years ago

Hi Tim, I'm currently away from the office but just wanted so say this is a cool suggestion. I'll follow up when I'm back in a week or so.

On Thu, 22 Aug 2019, 11:35 Tim Millar, notifications@github.com wrote:

I also wanted to mention that GenotypeAlleleCountsArray is good for working with mixed ploidy data but doesn't indicate partial genotypes (or doesn't differentiate them from a genotype of lower ploidy).

This can be worked around by masking out partial genotypes in a GenotypeArray before converting to GenotypeAlleleCountsArray but only if all genotypes in the GenotypeArray are the same ploidy.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/287?email_source=notifications&email_token=AAFLYQW7YUXBYRVPASK5YDLQFZTW3A5CNFSM4IOTNFG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD44UVMI#issuecomment-523848369, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQTNTWNNO6OLJ3GZNFLQFZTW3ANCNFSM4IOTNFGQ .

timothymillar commented 4 years ago

Some more thoughts on this:

Using a magic number like -2 creates additional overhead on most functionality. Even if a function doesn't support mixed ploidy arrays, it would have to check the entire array for -2 in order to raise an error.

An alternative approach is to use an (optional) mask-like array containing the ploidy of each genotype call. A property e.g. sitewise_ploidy (default = None) could be set to an array of shape (n_variants, n_samples). This array would contain positive integers with maximum value equal to the size of the ploidy dimension of the genotype array. This would avoid overhead for functions that don't support mixed ploidy arrays as they would only need to check that g.sitewise_ploidy is None.

Mixed ploidy functionality can be built directly on the sitewise_ploidy array or by converting this array to an allele-call level boolean mask:

>>> g = GenotypeArray([
...    [[1,0,0,0], [0,0,-1,-1], [0,1,-1,-1]],
...    [[0,0,0,0], [0,0,-1,-1], [0,0,0,0]],
...    [[0,0,0,0], [0,-1,-1,-1], [0,0,1,1]],
...    [[0,1,2,2], [0,0,-1,-1], [0,2,2,2]],
...])
>>> g.sitewise_ploidy = np.array([
...    [4, 2, 4],
...    [4, 2, 4],
...    [4, 2, 4],
...    [4, 2, 4],
... ])
>>> g.ploidy_padding()  # needs a better name

returns:

array([[[False, False, False, False], [False, False, True, True], [False, False, False, False]],
       [[False, False, False, False], [False, False, True, True], [False, False, False, False]],
       [[False, False, False, False], [False, False, True, True], [False, False, False, False]],
       [[False, False, False, False], [False, False, True, True], [False, False, False, False]]])

This mask contains the same information that the use of-2 in the array would convey. It also has the advantage of ensuring padding is right-justified.

The implementation is something like:

def ploidy_padding(self):
    # handle fixed ploidy
    if self.sitewise_ploidy is None:
        return np.zeros(self.shape, dtype=np.bool)
    else:
        max_ploidy = self.shape[-1]
        allele_idx = np.tile(np.arange(max_ploidy), self.sitewise_ploidy.shape + (1,))
        return allele_idx >= self.sitewise_ploidy

Ideally the ploidy property would be re-used for the API with appropriate setter and getter methods. in this case if self._sitewise_ploidy is None then self.ploidy would return the size of the ploidy dimension as it currently does. Otherwise it returns self._sitewise_ploidy. External functions can the check the type or shape of self.ploidy with little overhead or perhaps use an additional .is_mixed_ploidy() method.

>>> g = GenotypeArray([
...    [[1,0,0,0], [0,0,-1,-1], [0,1,-1,-1]],
...    [[0,0,0,0], [0,0,-1,-1], [0,0,0,0]],
...    [[0,0,0,0], [0,-1,-1,-1], [0,0,1,1]],
...    [[0,1,2,2], [0,0,-1,-1], [0,2,2,2]],
...])
>>> g.ploidy
4
>>> g.ploidy = np.array([
...    [4, 2, 4],
...    [4, 2, 4],
...    [4, 2, 4],
...    [4, 2, 4],
... ])
>>> g.ploidy
np.array([
   [4, 2, 4],
   [4, 2, 4],
   [4, 2, 4],
   [4, 2, 4],
])

The main drawback of this approach is the added complexity of an additional array. However it restricts this complexity to methods/functions that explicitly support mixed-ploidy genotypes.

timothymillar commented 4 years ago

@alimanfoo can you please clarify the intention of the n_allele_calls and related properties of Genotypes? n_allele_calls and n_calls ignore values of -1 or masked genotypes and so they are essentially descriptions of the array shape? I'm unsure if n_allele_calls should take into account variable ploidy levels or simply describe the array shape.

timothymillar commented 4 years ago

@alimanfoo after having a go at both of the implementations I suggested above I think neither of them are ideal:

Both implementations will require most/all existing functions to be modified to either handle mixed ploidy data or raise an error
The -2 based method is simpler but creates additional overhead everywhere because all genotypes need to be checked for values of -2
The mask-like approach has less overhead but introduces a lot of complexity to the model

It's not worth introducing this much complexity into the core model/methods for a relatively niche use-case.

I think a better way forward is for the user to explicitly pass an array of site-wise ploidy values to the subset of functions which explicitly support mixed-ploidy data. read_vcf could read sitewise-ploidy of genotypes into an additional array called 'calldata/ploidy' of shape (variant, sample)

cggh / scikit-allel

Suggestion to improve handling/io of mixed ploidy genotypes #287