cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 50 forks source link

Fst input format #336

Open NTNguyen13 opened 4 years ago

NTNguyen13 commented 4 years ago

Hi, I'm currently calculating Fst using scikit-allel module.

I tried to use a pandas dataframe with format (n_variants, n_samples), each cell is a list of genotype, like this:


site                  HG00096   HG00097  HG00099  HG00100  HG00101                                                       
1:20915411-["G","A"]  [0, 0]  [0, 0]  [0, 0]  [0, 0]  [0, 0]
1:20915418-["C","G"]  [0, 0]  [0, 0]  [1, 0]  [0, 1]  [1, 0]
1:20915441-["G","A"]  [0, 0]  [0, 0]  [1, 0]  [0, 0]  [0, 0]

but scikit allele does not accept this format, and return the error:

a, b, c = allel.weir_cockerham_fst(test2, subpops)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Untitled-2 in 
----> 39 a, b, c = allel.weir_cockerham_fst(test2, subpops)

~/anaconda3/envs/hail/lib/python3.7/site-packages/allel/stats/fst.py in weir_cockerham_fst(g, subpops, max_allele, blen)
    107         g = GenotypeArray(g, copy=False)
    108     if g.ndim != 3:
--> 109         raise ValueError('g must have three dimensions')
    110     if g.shape[2] != 2:
    111         raise NotImplementedError('only diploid genotypes are supported')

ValueError: g must have three dimensions

Please advice me on how to process the right format for this function. At this moment I extract the genotype from VCF files, then I split the genotype by '|' and converting it to int, I wonder if there are native methods to read vcf to genotype array exists

hardingnj commented 4 years ago

Hi @NTNguyen13

You are looking for: https://scikit-allel.readthedocs.io/en/stable/io.html

If your VCF contains multiple chromosomes, you will need to supply the region=?? argument to select a single chromosome.

Thanks for your question, but generally, this issue tracker is for potential bugs. User queries should be addressed to https://groups.google.com/g/scikit-allel