Closed v-a-s-a closed 5 years ago
Happy to poke around, just let me know if you have any ideas about where to start.
Hi Vassily, thanks a lot for reporting. It might help if we could narrow down which field is causing this error. Easiest way to do that would be to run for one field at a time, i.e., instead of fields='*' use fields='variants/X' where X is any of the fixed fields (CHROM, POS, REF, ALT, etc.) or any INFO subfield or fields='calldata/X' where X is GT or any FORMAT field.
On Sat, 27 Oct 2018, 21:34 Vassily Trubetskoy, notifications@github.com wrote:
Happy to poke around, just let me know if you have any ideas about where to start.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433653207, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqXNLZd2J3wXkHNeWosZbmT9lsKtks5upMNqgaJpZM4X9ocs .
It appears to be the 'variants/svlen' that gets added here.
Interestingly, I do havescikit-allel
installed on our linux cluster, and this field does not appear to present any issues there. Though that is not such a rigorous comparison.
That is not quite right. The bug appears with a certain configuration of fields is present. The minimum being: ['variants/ALT', 'variants/INFO', 'variants/svlen']
. Where 'variants/svlen' is explicitly added to the list of fields when the function is called.
Removing any one of these, and adding any number of other fields has the function run without error.
Thanks Vassily, that's really helpful. Could you tell me which version of numpy you have installed on your Mac and on your Linux cluster?
On Sun, 28 Oct 2018, 13:17 Vassily Trubetskoy, notifications@github.com wrote:
That is not quite right. The bug appears with a certain configuration of fields is present. The minimum being: ['variants/ALT', 'variants/INFO', 'variants/svlen']. Where 'variants/svlen' is explicitly added to the list of fields when the function is called.
Removing any one of these, and adding any number of other fields has the function run without error.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433723800, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QuatLYroDtDT1WP_fWSj5ROjwm_Nks5upeaUgaJpZM4X9ocs .
I am using Anaconda to manage environments on both. The numpy versions were different initially: v1.13.1 on the mac, and v1.15.2 on the cluster.
I've just upgraded my Mac numpy to v1.15.3, but I still encounter the same error.
Its a really great framework, I'm enjoying playing around with it!
Well I'm glad you're enjoying it in spite of the mysterious errors!
Thanks for confirming the numpy versions, good to rule that out.
I don't have any immediate hunch as to where to go next investigating this bug, I'll need to do some digging. Something apparently OS-specific like this is unusual, especially if it affects both Windows and Mac but not Linux. I'm at a conference this week but will try to grab some time asap.
On Sun, 28 Oct 2018 at 21:25, Vassily Trubetskoy notifications@github.com wrote:
I am using Anaconda to manage environments on both. The numpy versions were different initially: v1.13.1 on the mac, and v1.15.2 on the cluster.
I've just upgraded my Mac numpy to v1.15.3, but I still encounter the same error.
Its a really great framework, I'm enjoying playing around with it!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/215#issuecomment-433742924, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qp-CDgTU9aSDpxGKY5Yw5TbWl2_Kks5upiDggaJpZM4X9ocs .
-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply.
Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo
OK, I think I have an explanation for this.
This VCF file has an INFO field named 'SVLEN'. scikit-allel will also add a computed field called 'svlen' if you explicitly ask for it by including 'variants/svlen' among the requested fields or by providing fields='*'.
If you use the read_vcf() function then you will get back two separate arrays named 'variants/SVLEN' and 'variants/svlen'. All good.
If you are using vcf_to_zarr(), and you are on a system with a case-sensitive file system, then you will also get two separate arrays in the zarr store, named 'variants/SVLEN' and 'variants/svlen'. Again, good.
If you are using vcf_to_zarr() on a system with a case-insensitive file system, which I believe will be the default on mac and windows, then problems can occur because these two arrays will get confused, and an error can arise e.g. if an attempt is made to append data from the svlen
data from the second chunk to the SVLEN
array created from the first chunk. Whether this error arises depends indirectly on what other fields have been requested, I suspect because of subtle ordering effects.
Bottom line, this is not a bug per se, but a name clash between two fields that occurs on case-insensitive file systems. Probably the best we could do here is to try to mitigate the risk of this occurring in some way, and/or detect and raise a specific error if it does occur. Giving that some thought...
Just an FYI... A huge vcf I have triggered this exception when I tried fields='*'
with vcf_to_zarr.
Said I had variants/NUMALT
and variants/numalt
, despite the fact that the string numalt
(lower-case) does not occur anywhere in the vcf file.
I'm too busy at the moment to track it down and the associated file is way too big to share... so just a heads-up in case (pun intended) someone else hits the same bug.
Thanks @travc.
FWIW if providing fields='*'
then scikit-allel will add a variants/numalt
field which it computes on the fly. If the VCF also has a NUMALT
field in the INFO field then there will be a clash on case-insensitive file systems, because scikit-allel will try to create two datasets, one called variants/numalt
(computed by scikit-allel) and one called variants/NUMALT
(from the INFO field). In scikit-allel 1.1 this would generate a cryptic error. In scikit-allel 1.2 this will hopefully generate a more informative error, because it specifically checks for this type of name clash.
The easiest workaround is probably to use the exclude_fields
argument added in scikit-allel 1.2. E.g., something like this should stop scikit-allel trying to add the computed variants/numalt
field:
vcf_path = '/path/to/data.vcf'
zarr_path = '/path/to/data.zarr'
allel.vcf_to_zarr(vcf_path, zarr_path, fields='*', exclude_fields='variants/numalt')
There is also a rename_fields
argument that can be used to rename one of the clashing fields, in the case where you want both.
Further details on changes in scikit-allel 1.2 to mitigate this issue are in #216.
Hth.
I'm encountering the same error as #202, just on macOS.
Installation:
conda install -c conda-forge scikit-allel
Code:
Error/trace:
Environment: