cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
288 stars 50 forks source link

joining genotype arrays #278

Closed mufernando closed 5 years ago

mufernando commented 5 years ago

Hi,

I was wondering if you have an effective way of joining two genotype arrays that do not have necessarily the same positions genotyped.

Thank you!

alimanfoo commented 5 years ago

Hi Murillo, would you like to join them on the intersection of the sites? That is possible. Also, will they have the same alleles at each site, or might the alleles be different (that complicates things).

On Mon, 15 Jul 2019, 21:05 Murillo Fernando Rodrigues, < notifications@github.com> wrote:

Hi,

I was wondering if you have an effective way of joining two genotype arrays that do not have necessarily the same positions genotyped.

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/278?email_source=notifications&email_token=AAFLYQRW4TOUOEUG6ACVWNDP7TKBBA5CNFSM4ID2KLZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7J3NFA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQT7MPJKCXT35Q6XDMLP7TKBBANCNFSM4ID2KLZA .

mufernando commented 5 years ago

Hi Alistair,

Not necessarily the same alleles. And I was thinking more like a full outer join (i.e., keeping all sites but changing the ones not present in a few of the individuals as missing).

Thanks!

mufernando commented 5 years ago

Sorry, Alistair, they have the same alleles.

alimanfoo commented 5 years ago

Hi Murillo, thanks, yes you could join two genotype arrays like this. The workflow would look something like this...

# obtain variant positions for first callset somehow
pos1 = allel.SortedIndex(callset1[chrom]['variants']['POS'])
# obtain variant positions for second callset somehow
pos2 = allel.SortedIndex(callset2[chrom]['variants']['POS'])
# find the intersection of the positions
loc1, loc2 = pos1.locate_intersection(pos2)
# obtain genotypes for first callset somehow
gt1 = allel.GenotypeArray(callset1[chrom]['variants']['GT'])
# select intersection sites
gt1_isec = gt1.compress(loc1, axis=0)
# obtain genotypes for second callset somehow
gt2 = allel.GenotypeArray(callset1[chrom]['variants']['GT'])
# select intersection sites
gt2_isec = gt2.compress(loc2, axis=0)
# sanity checks
assert gt1_isec.shape[0] == gt2_isec.shape[0]
# concatenate along samples dimension
gt_concat = gt1_isec.concatenate([gt2_isec], axis=1)

Hth.