cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
288 stars 50 forks source link

Selecting a specific chromosome in a multi chromosome H5 file #303

Closed neobernad closed 4 years ago

neobernad commented 4 years ago

Hi,

I have a VariantChunkedTable object named as variants that looks like follows:

CHROM POS
0 chr01 405
1 chr01 1560
... ... ...
1426758 chr12 27563076
1426759 chr12 27563077

I would like to extract a list of indexes of the variants that belongs to a certain chromosome (e.g. chr01), so that later on I could use those indexes to subset the positions from a GenotypeChunkedArray object.

So far I could get up to this point:

filter_expression = '(CHROM == "chr01")'
variant_selection = variants.eval(filter_expression)[:]
np.count_nonzero(variant_selection)

The previous instructions output 200102, that is to say, there are variants belonging to chr01, however, the following instruction crashes:

variants_pass = variants.compress(variant_selection)

ValueErrorTraceback (most recent call last)
<ipython-input-316-2440de6fdad6> in <module>
      5 np.count_nonzero(variant_selection) # How many variants do we keep
      6 #np.count_nonzero(~variant_selection) # How many variants do we filter out
----> 7 variants_pass = vt.compress(variant_selection)
...
/opt/conda/lib/python3.7/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    386             if not filters:
    387                 # there are no filters so we can be sure there is no object codec
--> 388                 raise ValueError('missing object_codec for object array')
    389             else:
    390                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

This error seems to be related to #169 and unfortunately it is up to zarr to take care of this.

That said, is there any other way to subset genotypes depending on what chromsome they are?

Thanks, José Antonio

neobernad commented 4 years ago

I could solve it!

The VariantChunkedTable can be transformed into a numpy array by slicing:

vt_np_array = variants[:] # THIS! vt_np_array is a VariantTable object
filter_expression = '(CHROM == \'chr01\')'
variant_selection = vt_np_array.eval(filter_expression)[:]
variants_pass = vt_np_array.compress(variant_selection)