Open timothymillar opened 5 years ago
I also wanted to mention that GenotypeAlleleCountsArray
is good for working with mixed ploidy data but doesn't indicate partial genotypes (or doesn't differentiate them from a genotype of lower ploidy).
This can be worked around by masking out partial genotypes in a GenotypeArray
before converting to GenotypeAlleleCountsArray
but only if all genotypes in the GenotypeArray
are the same ploidy.
Hi Tim, I'm currently away from the office but just wanted so say this is a cool suggestion. I'll follow up when I'm back in a week or so.
On Thu, 22 Aug 2019, 11:35 Tim Millar, notifications@github.com wrote:
I also wanted to mention that GenotypeAlleleCountsArray is good for working with mixed ploidy data but doesn't indicate partial genotypes (or doesn't differentiate them from a genotype of lower ploidy).
This can be worked around by masking out partial genotypes in a GenotypeArray before converting to GenotypeAlleleCountsArray but only if all genotypes in the GenotypeArray are the same ploidy.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/287?email_source=notifications&email_token=AAFLYQW7YUXBYRVPASK5YDLQFZTW3A5CNFSM4IOTNFG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD44UVMI#issuecomment-523848369, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQTNTWNNO6OLJ3GZNFLQFZTW3ANCNFSM4IOTNFGQ .
Some more thoughts on this:
Using a magic number like -2
creates additional overhead on most functionality.
Even if a function doesn't support mixed ploidy arrays, it would have to check the entire array for -2
in order to raise an error.
An alternative approach is to use an (optional) mask-like array containing the ploidy of each genotype call.
A property e.g. sitewise_ploidy
(default = None
) could be set to an array of shape (n_variants, n_samples).
This array would contain positive integers with maximum value equal to the size of the ploidy dimension of the genotype array.
This would avoid overhead for functions that don't support mixed ploidy arrays as they would only need to check that g.sitewise_ploidy is None
.
Mixed ploidy functionality can be built directly on the sitewise_ploidy
array or by converting this array to an allele-call level boolean mask:
>>> g = GenotypeArray([
... [[1,0,0,0], [0,0,-1,-1], [0,1,-1,-1]],
... [[0,0,0,0], [0,0,-1,-1], [0,0,0,0]],
... [[0,0,0,0], [0,-1,-1,-1], [0,0,1,1]],
... [[0,1,2,2], [0,0,-1,-1], [0,2,2,2]],
...])
>>> g.sitewise_ploidy = np.array([
... [4, 2, 4],
... [4, 2, 4],
... [4, 2, 4],
... [4, 2, 4],
... ])
>>> g.ploidy_padding() # needs a better name
returns:
array([[[False, False, False, False], [False, False, True, True], [False, False, False, False]],
[[False, False, False, False], [False, False, True, True], [False, False, False, False]],
[[False, False, False, False], [False, False, True, True], [False, False, False, False]],
[[False, False, False, False], [False, False, True, True], [False, False, False, False]]])
This mask contains the same information that the use of-2
in the array would convey.
It also has the advantage of ensuring padding is right-justified.
The implementation is something like:
def ploidy_padding(self):
# handle fixed ploidy
if self.sitewise_ploidy is None:
return np.zeros(self.shape, dtype=np.bool)
else:
max_ploidy = self.shape[-1]
allele_idx = np.tile(np.arange(max_ploidy), self.sitewise_ploidy.shape + (1,))
return allele_idx >= self.sitewise_ploidy
Ideally the ploidy
property would be re-used for the API with appropriate setter and getter methods.
in this case if self._sitewise_ploidy is None
then self.ploidy
would return the size of the ploidy dimension as it currently does.
Otherwise it returns self._sitewise_ploidy
.
External functions can the check the type or shape of self.ploidy
with little overhead or perhaps use an additional .is_mixed_ploidy()
method.
>>> g = GenotypeArray([
... [[1,0,0,0], [0,0,-1,-1], [0,1,-1,-1]],
... [[0,0,0,0], [0,0,-1,-1], [0,0,0,0]],
... [[0,0,0,0], [0,-1,-1,-1], [0,0,1,1]],
... [[0,1,2,2], [0,0,-1,-1], [0,2,2,2]],
...])
>>> g.ploidy
4
>>> g.ploidy = np.array([
... [4, 2, 4],
... [4, 2, 4],
... [4, 2, 4],
... [4, 2, 4],
... ])
>>> g.ploidy
np.array([
[4, 2, 4],
[4, 2, 4],
[4, 2, 4],
[4, 2, 4],
])
The main drawback of this approach is the added complexity of an additional array. However it restricts this complexity to methods/functions that explicitly support mixed-ploidy genotypes.
@alimanfoo can you please clarify the intention of the n_allele_calls
and related properties of Genotypes?
n_allele_calls
and n_calls
ignore values of -1
or masked genotypes and so they are essentially descriptions of the array shape?
I'm unsure if n_allele_calls
should take into account variable ploidy levels or simply describe the array shape.
@alimanfoo after having a go at both of the implementations I suggested above I think neither of them are ideal:
It's not worth introducing this much complexity into the core model/methods for a relatively niche use-case.
I think a better way forward is for the user to explicitly pass an array of site-wise ploidy values to the subset of functions which explicitly support mixed-ploidy data.
read_vcf
could read sitewise-ploidy of genotypes into an additional array called 'calldata/ploidy' of shape (variant, sample)
This is a suggestion for a potential way to improve handling of mixed ploidy data in
GenotypeArray
.Currently
read_vcf
pads allGenotypeArray
rows to the length specified bynumbers={'GT': ...}
with-1
. This can lead to ambiguity when reading mixed ploidy VCF files because it's not clear if-1
indicates a partial genotype or a complete genotype of lower ploidy.For example, if a VCF is read with
numbers={'GT': 4}
then the array[0,1,-1,-1]
could be the partial tetraploid genotype'0/1/./.'
or the complete diploid genotype'0/1'
A potential solution to this would be to pad the arrays with a different value i.e.
-2
to indicate the ploidy of the genotype so that:'0/1/./.'
=[0,1,-1,-1]
'0/1'
=[0,1,-2,-2]
'0/1/.'
=[0,1,-1,-2]
I'm not sure of the full implications of this suggestion but most of the core methods are easy to re-implement with this change for example an array with two tetraploids and a diploid:
is_called:
is_missing:
loci-wise ploidy of each sample