Selecting a specific chromosome in a multi chromosome H5 file

Hi,

I have a VariantChunkedTable object named as variants that looks like follows:

	CHROM	POS
0	chr01	405
1	chr01	1560
...	...	...
1426758	chr12	27563076
1426759	chr12	27563077

I would like to extract a list of indexes of the variants that belongs to a certain chromosome (e.g. chr01), so that later on I could use those indexes to subset the positions from a GenotypeChunkedArray object.

So far I could get up to this point:

filter_expression = '(CHROM == "chr01")'
variant_selection = variants.eval(filter_expression)[:]
np.count_nonzero(variant_selection)

The previous instructions output 200102, that is to say, there are variants belonging to chr01, however, the following instruction crashes:

variants_pass = variants.compress(variant_selection)

ValueErrorTraceback (most recent call last)
<ipython-input-316-2440de6fdad6> in <module>
      5 np.count_nonzero(variant_selection) # How many variants do we keep
      6 #np.count_nonzero(~variant_selection) # How many variants do we filter out
----> 7 variants_pass = vt.compress(variant_selection)
...
/opt/conda/lib/python3.7/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    386             if not filters:
    387                 # there are no filters so we can be sure there is no object codec
--> 388                 raise ValueError('missing object_codec for object array')
    389             else:
    390                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

This error seems to be related to #169 and unfortunately it is up to zarr to take care of this.

That said, is there any other way to subset genotypes depending on what chromsome they are?

Thanks, José Antonio

cggh / scikit-allel

Selecting a specific chromosome in a multi chromosome H5 file #303