cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 50 forks source link

vcf_to_zarr problem on macOS and windows #215

Closed v-a-s-a closed 5 years ago

v-a-s-a commented 6 years ago

I'm encountering the same error as #202, just on macOS.

Installation: conda install -c conda-forge scikit-allel

Code:

!wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi --no-clobber ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
import allel as ska
import numcodecs
vcf_file = 'ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'
ska.vcf_to_zarr(vcf_file, vcf_file.replace('.vcf.gz', '.zarr'),
                  fields='*', alt_number=8, overwrite=True,
                  compressor=numcodecs.Blosc(cname='zstd', clevel=1, shuffle=False))

Error/trace:

Traceback (most recent call last):
  File "./ska_ld_matrix.py", line 13, in <module>
    compressor=numcodecs.Blosc(cname='zstd', clevel=1, shuffle=False))
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/allel/io/vcf_read.py", line 802, in vcf_to_zarr
    _zarr_store_chunk(root, keys, chunk)
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/allel/io/vcf_read.py", line 692, in _zarr_store_chunk
    root[k].append(chunk[k], axis=0)
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/zarr/core.py", line 2049, in append
    return self._write_op(self._append_nosync, data, axis=axis)
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/zarr/core.py", line 1955, in _write_op
    return self._synchronized_op(f, *args, **kwargs)
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/zarr/core.py", line 1945, in _synchronized_op
    result = f(*args, **kwargs)
  File "/Users/vasya/anaconda3/envs/ld_mat/lib/python3.6/site-packages/zarr/core.py", line 2063, in _append_nosync
    raise ValueError('shape of data to append is not compatible with the array; '
ValueError: shape of data to append is not compatible with the array; all dimensions must match except for the dimension being appended

Environment:

v-a-s-a commented 6 years ago

Happy to poke around, just let me know if you have any ideas about where to start.

alimanfoo commented 6 years ago

Hi Vassily, thanks a lot for reporting. It might help if we could narrow down which field is causing this error. Easiest way to do that would be to run for one field at a time, i.e., instead of fields='*' use fields='variants/X' where X is any of the fixed fields (CHROM, POS, REF, ALT, etc.) or any INFO subfield or fields='calldata/X' where X is GT or any FORMAT field.

On Sat, 27 Oct 2018, 21:34 Vassily Trubetskoy, notifications@github.com wrote:

Happy to poke around, just let me know if you have any ideas about where to start.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433653207, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqXNLZd2J3wXkHNeWosZbmT9lsKtks5upMNqgaJpZM4X9ocs .

v-a-s-a commented 6 years ago

It appears to be the 'variants/svlen' that gets added here.

Interestingly, I do havescikit-allel installed on our linux cluster, and this field does not appear to present any issues there. Though that is not such a rigorous comparison.

v-a-s-a commented 6 years ago

That is not quite right. The bug appears with a certain configuration of fields is present. The minimum being: ['variants/ALT', 'variants/INFO', 'variants/svlen']. Where 'variants/svlen' is explicitly added to the list of fields when the function is called.

Removing any one of these, and adding any number of other fields has the function run without error.

alimanfoo commented 6 years ago

Thanks Vassily, that's really helpful. Could you tell me which version of numpy you have installed on your Mac and on your Linux cluster?

On Sun, 28 Oct 2018, 13:17 Vassily Trubetskoy, notifications@github.com wrote:

That is not quite right. The bug appears with a certain configuration of fields is present. The minimum being: ['variants/ALT', 'variants/INFO', 'variants/svlen']. Where 'variants/svlen' is explicitly added to the list of fields when the function is called.

Removing any one of these, and adding any number of other fields has the function run without error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433723800, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QuatLYroDtDT1WP_fWSj5ROjwm_Nks5upeaUgaJpZM4X9ocs .

v-a-s-a commented 6 years ago

I am using Anaconda to manage environments on both. The numpy versions were different initially: v1.13.1 on the mac, and v1.15.2 on the cluster.

I've just upgraded my Mac numpy to v1.15.3, but I still encounter the same error.

Its a really great framework, I'm enjoying playing around with it!

alimanfoo commented 6 years ago

Well I'm glad you're enjoying it in spite of the mysterious errors!

Thanks for confirming the numpy versions, good to rule that out.

I don't have any immediate hunch as to where to go next investigating this bug, I'll need to do some digging. Something apparently OS-specific like this is unusual, especially if it affects both Windows and Mac but not Linux. I'm at a conference this week but will try to grab some time asap.

On Sun, 28 Oct 2018 at 21:25, Vassily Trubetskoy notifications@github.com wrote:

I am using Anaconda to manage environments on both. The numpy versions were different initially: v1.13.1 on the mac, and v1.15.2 on the cluster.

I've just upgraded my Mac numpy to v1.15.3, but I still encounter the same error.

Its a really great framework, I'm enjoying playing around with it!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433742924, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qp-CDgTU9aSDpxGKY5Yw5TbWl2_Kks5upiDggaJpZM4X9ocs .

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

alimanfoo commented 6 years ago

OK, I think I have an explanation for this.

This VCF file has an INFO field named 'SVLEN'. scikit-allel will also add a computed field called 'svlen' if you explicitly ask for it by including 'variants/svlen' among the requested fields or by providing fields='*'.

If you use the read_vcf() function then you will get back two separate arrays named 'variants/SVLEN' and 'variants/svlen'. All good.

If you are using vcf_to_zarr(), and you are on a system with a case-sensitive file system, then you will also get two separate arrays in the zarr store, named 'variants/SVLEN' and 'variants/svlen'. Again, good.

If you are using vcf_to_zarr() on a system with a case-insensitive file system, which I believe will be the default on mac and windows, then problems can occur because these two arrays will get confused, and an error can arise e.g. if an attempt is made to append data from the svlen data from the second chunk to the SVLEN array created from the first chunk. Whether this error arises depends indirectly on what other fields have been requested, I suspect because of subtle ordering effects.

Bottom line, this is not a bug per se, but a name clash between two fields that occurs on case-insensitive file systems. Probably the best we could do here is to try to mitigate the risk of this occurring in some way, and/or detect and raise a specific error if it does occur. Giving that some thought...

travc commented 5 years ago

Just an FYI... A huge vcf I have triggered this exception when I tried fields='*' with vcf_to_zarr. Said I had variants/NUMALT and variants/numalt, despite the fact that the string numalt (lower-case) does not occur anywhere in the vcf file.

I'm too busy at the moment to track it down and the associated file is way too big to share... so just a heads-up in case (pun intended) someone else hits the same bug.

alimanfoo commented 5 years ago

Thanks @travc.

FWIW if providing fields='*' then scikit-allel will add a variants/numalt field which it computes on the fly. If the VCF also has a NUMALT field in the INFO field then there will be a clash on case-insensitive file systems, because scikit-allel will try to create two datasets, one called variants/numalt (computed by scikit-allel) and one called variants/NUMALT (from the INFO field). In scikit-allel 1.1 this would generate a cryptic error. In scikit-allel 1.2 this will hopefully generate a more informative error, because it specifically checks for this type of name clash.

The easiest workaround is probably to use the exclude_fields argument added in scikit-allel 1.2. E.g., something like this should stop scikit-allel trying to add the computed variants/numalt field:

vcf_path = '/path/to/data.vcf'
zarr_path = '/path/to/data.zarr'
allel.vcf_to_zarr(vcf_path, zarr_path, fields='*', exclude_fields='variants/numalt')

There is also a rename_fields argument that can be used to rename one of the clashing fields, in the case where you want both.

Further details on changes in scikit-allel 1.2 to mitigate this issue are in #216.

Hth.