Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

3 things: iteritems() vs items(); rect to polar; sorting errors #26

Closed dmcgoldrick closed 5 years ago

dmcgoldrick commented 5 years ago

1) using python 3.5+ we get an error because python likes the use of ".items()" not the old ".iteritems()" from the 2.7ish syntax? Is this an issue for others using 3+?

LocusEntryFactory.py: for _, value in position2record.iteritems():

2) These two sort lines are crashing saying they cannot sort string() < int() or some type conflict

2.1)LocusEntryFactory.py: return sorted(result, key=lambda entry: (self._chrom_sort_function(entry.vcf_record.CHROM), entry.vcf_record.POS, entry.vcf_record.REF)) 2.2) ReaderTemplateFactory.py: sorted(contigs.items(), key=lambda x: self._chrom_sort_function(x[0])))

e.g.: Traceback (most recent call last): if limit >= 0: TypeError: unorderable types: AttributeError() >= int()

3) IlluminaBeadArrayFiles.py", line 600 def rect_to_polar((x,y)): this should be def rect_to_polar(x,y):? does the code even use this function ? Anyhow it's a syntax error for me

KelleyRyanM commented 5 years ago

@dmcgoldrick , At the moment, this tool is only compatible with python 2.7.x. The three issues you've noted above are caused by Python 2 vs 3 differences. There is an open pull request (https://github.com/Illumina/GTCtoVCF/pull/24) with fixes for python 3 compatibility; however, I was still following up on the syntax error for the rect_to_polar call. As currently, this is a function that takes a single argument, which is a tuple.

dmcgoldrick commented 5 years ago

Thanks Kelly --

The current sorting of the data is not working for us ether :-/ i.e. when you return sorted(foo...). Also is this code robust to encoding for latin-1 vs utf-8? We get issues with that in our pipelines. So we just had a conversion failure with a batch of gtc's that completed with the gcf2vcf tool (which then can't handle indels and made outdated vcf). 1300 samples.

https://github.com/freeseek/gtc2vcf

We ran yours and found that it was having issues with modern gtc data using AutoConvert directly from Illumina. We don't like miniconda. Also version 3+ is fairly standard now - time has come - 2.7+ has problems with interfaces/syntax/encodings and interactions with htslib, pysam, pyvcf...

Weird but the python imterpreter doesn't like your tuple syntax but it should be able to take a tuple :-)

Your competitor tool gtc2vcf makes vcf version 3 - (yes yuck). Then the output says run vcf-convert to get to the current 4.0 spec which is using perl module Vcf.pm so then we need to put in a perl dependency... geesh

We need a simple and robust backward compatibile and up-to-date solution we can rely on.

So none of these solutions are quite there for us for a real genome facility process? If we were processing 100,000 genomes and chips both code bases would be a nightmare for our pipeline with all the post processing, dependency webs and lack of informative errors.

The latest issue was the vcf Writer class failed and did not dump to vcf with a new batch of gtc files and the entry objects are not correct for the gtc we have - i put in some code to show me what it made and one batch is parsing and the other batch of gtc is not. So it is unstable depending on the manifest and gtc but it does not help us find or tell us the issues.

I will try to look at the pull request if you like?

So.... we don't yet have a complete up-to-date solution implemented.

best, Daniel

On Tue, Nov 27, 2018 at 8:04 PM KelleyRyanM notifications@github.com wrote:

@dmcgoldrick https://github.com/dmcgoldrick , At the moment, this tool is only compatible with python 2.7.x. The three issues you've noted above are caused by Python 2 vs 3 differences. There is an open pull request (#24 https://github.com/Illumina/GTCtoVCF/pull/24) with fixes for python 3 compatibility; however, I was still following up on the syntax error for the rect_to_polar call. As currently, this is a function that takes a single argument, which is a tuple.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/26#issuecomment-442311825, or mute the thread https://github.com/notifications/unsubscribe-auth/ARa6I4nUc__uaLH3aZvA77bNZW3E7mTgks5uzgsYgaJpZM4YeX7h .

-- Daniel J McGoldrick Ph.D

CMG and GRC UW Genome Sciences Center

Box 355065Seattle, WA 98195(206) 685-7342

KelleyRyanM commented 5 years ago

If python 3 compatibility is a critical need, it should be possible to support. I created a new branch at https://github.com/Illumina/GTCtoVCF/tree/python3_compatibility and currently all regression and unit tests are passing with an install of python 3.5.6.

It may be worthwhile to see if the changes in that branch address your issues before it is merged back into develop. This tool supports GTC files written by the latest version of AutoConvert. If have some files where that is not the case, I could try to help troubleshoot.

dmcgoldrick commented 5 years ago

Got it ! --

I really appreciate that new branch - great news - I will check it out immediately and get back to you with what we are getting :-).

best, Daniel

On Wed, Nov 28, 2018 at 8:16 PM KelleyRyanM notifications@github.com wrote:

If python 3 compatibility is a critical need, it should be possible to support. I created a new branch at https://github.com/Illumina/GTCtoVCF/tree/python3_compatibility and currently all regression and unit tests are passing with an install of python 3.5.6.

It may be worthwhile to see if the changes in that branch address your issues before it is merged back into develop. This tool supports GTC files written by the latest version of AutoConvert. If have some files where that is not the case, I could try to help troubleshoot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/26#issuecomment-442700551, or mute the thread https://github.com/notifications/unsubscribe-auth/ARa6I3bmiaY0fukBqzMyhihxKWoRtPwoks5uz19fgaJpZM4YeX7h .

-- Daniel J McGoldrick Ph.D

CMG and GRC UW Genome Sciences Center

Box 355065Seattle, WA 98195(206) 685-7342

dmcgoldrick commented 5 years ago

Hi Ryan --

This new update on your branch ran to completion on a repeat of your initial regression test, and 1300 samples and the output is being evaluated against a control. We are not getting the errors we had before! :-) and so far so good.

Outstanding work! I think the issue we had in not getting a parse at all has been fixed with the update to python 3.5.6.

We will let you know if we discover any other issues - but I think you slam-dunked this one.

best, Daniel

On Wed, Nov 28, 2018 at 8:16 PM KelleyRyanM notifications@github.com wrote:

If python 3 compatibility is a critical need, it should be possible to support. I created a new branch at https://github.com/Illumina/GTCtoVCF/tree/python3_compatibility and currently all regression and unit tests are passing with an install of python 3.5.6.

It may be worthwhile to see if the changes in that branch address your issues before it is merged back into develop. This tool supports GTC files written by the latest version of AutoConvert. If have some files where that is not the case, I could try to help troubleshoot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/26#issuecomment-442700551, or mute the thread https://github.com/notifications/unsubscribe-auth/ARa6I3bmiaY0fukBqzMyhihxKWoRtPwoks5uz19fgaJpZM4YeX7h .

-- Daniel J McGoldrick Ph.D

CMG and GRC UW Genome Sciences Center

Box 355065Seattle, WA 98195(206) 685-7342

KelleyRyanM commented 5 years ago

Thank you for your feedback in improving the tool.