cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

allel.pca should throw more informative error when invariant rows are passed #307

Open schimar opened 4 years ago

schimar commented 4 years ago

Hi,

With allel version 1.2.1 and python3.7, I am trying to perform a PCA (using SVD) with my data, and I pretty much followed the "Fast-PCA" post on alimanfoo.github.io, where I do:

coords1, model1 = al.pca(gnu, n_components=10, scaler='patterson')

This returns (in short):

/home/../.local/lib/python3.7/site-packages/numpy/lib/function_base.py in asarray_chkfinite(a, dtype, order)
    496     if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
    497         raise ValueError(
--> 498             "array must not contain infs or NaNs")
    499     return a

ValueError: array must not contain infs or NaNs`

I checked, and gnu seems to be of type integer:

isinstance(gnu.flat[0], np.integer)  
Out[..]: True

np.isfinite(gnu).all() returns True and np.isnan(gnu).any() returns False


When converting the contents of gnu to float, with:

gnufl = np.empty(list(gnu.shape))
for i in range(gnu.shape[1]):
    gnufl[:,i] = gnu[:,i].astype(float)

Now, both isinstance(gnufl.flat[0], np.float) and np.isfinite(gnufl).all() return True, and np.isnan(gnufl).any() returns 'False'

So, no NaNs and no infs in there, however, the error is the same as before.

What am I missing?

I'd really appreciate your help, thanks a lot & all the best, martin

PS: Just in case, here's the output of a (what I think should do a) check for tyoe of all values in gnufl

def chkFloat(x):
    outls = list()
    for i in range(len(x.flat)):
        y = isinstance(x.flat[i], np.floating)
        outls.append(y)
    return outls

x = chkFloat(gnufl)
np.array(x).all()
Out[..]: True
hardingnj commented 4 years ago

Hi Martin.

This does seem a bit strange. Might be worth investigating a little further what exactly gnu is.

It is a numpy array it seems. What's gnu.dtype and gnu.shape?

Can you share the snippet where gnu is created?

hardingnj commented 4 years ago

Ah- please ignore what I wrote above. This error seems to be thrown when a variant is not segregating. ie all values in a row are identical. I appreciate the error could be clearer here, but removing invariant rows should fix your problem.

schimar commented 4 years ago

of course, that makes total sense! for future reference (if needed) the following does the trick:

is_segAll = ac_subpops['all'].is_segregating()[:]
# with ac_subpops['all'] being the allele counts
gtsubseg = gtsub.compress(is_segAll, axis=0)

def ld_prune(gn, size, step, threshold=.1, n_iter=1):
    for i in range(n_iter):
        loc_unlinked = al.locate_unlinked(gn, size=size, step=step, threshold=threshold)
        n = np.count_nonzero(loc_unlinked)
        n_remove = gn.shape[0] - n
        print('iteration', i+1, 'retaining', n, 'removing', n_remove, 'variants')
        gn = gn.compress(loc_unlinked, axis=0)
    return gn

gnu = ld_prune(nAltSub, size=200, step=50, threshold=.1, n_iter=5)

coords1, model1 = al.pca(gnu, n_components=10, scaler='patterson')

Now, the pca runs fine. thanks so much, martin

alimanfoo commented 4 years ago

Just to add, there is a rare edge case where this error can still arise even for a segregating variant, if the variant has a heterozygous genotype in all individuals. In this case, although the variant is segregating, all individuals have the same genotype, and so after converting the data via GenotypeArray.to_n_alt() all values in the corresponding row are the same.

A workaround for this is something like this, where you compare all values to the first value in each row:

gt = ...  # some genotype array
gn = gt.to_n_alt()
is_informative = np.any(gn[:, 0, np.newaxis] != gn, axis=1)
gn_informative = gn.compress(is_informative, axis=0)