Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 31 forks source link

VCF REF column outputting in bytes instead of string #69

Closed rarsenal closed 2 years ago

rarsenal commented 2 years ago

Hello, we are trying to apply GTCtoVCF on Illumina's iScan data with Global Diversity Array. We've converted from IDAT to GTC via iaap-cli, but we noticed the VCF output from GTCtoVCF has a few formatting issues that hopefully you could help resolve.

  1. In the REF column, instead of a base C, we have b'C', which suggested that the output from the python script is in bytes instead of strings.
  2. In the ALT column, we not only have the alternative allele, but also the reference allele. Could this be related to the fact that the REF column is using bytes characters instead of strings?
  3. In the GT field, genotypes are encoded as 1, 2. no 0. Not sure if this is also related to the bytes format?

Example line from our output: 1 762320 JHU_1.762319,exm2268640 b'C' T,C . PASS . GT:GQ 2/2:6

Are there environmental variables that we should specify to prevent this behavior?

For reference, we used the manifest from https://support.illumina.com/downloads/infinium-global-diversity-array-v1-product-files.html and the references were built using the provided download_reference.sh

jjzieve commented 2 years ago

@rarsenal Thanks for bringing this to our attention. I will try to reproduce this. What version of python are you running? Also, did any errors occur when building the reference genome fasta? This issue seems like it could be related to https://github.com/Illumina/GTCtoVCF/issues/64 but that only occurred with a custom fasta file.

rarsenal commented 2 years ago

Hi jjzieve,

Thanks for the fast reply! Actually I just found the source of the error. I containerized the various tools for the pipeline we are building, so the container had both python2 and python3 environments built in. After I separated the GTCtoVCF component into a standalone container with only miniconda2 base environment, everything is working as expect. I suppose that GTCtoVCF was inadvertently running on python3 and while it produced no runtime errors, its bytes/string decoding functions are not compatible with python3? Anyhow, thanks again for your attention, and you can close the issue when you see fit.

jjzieve commented 2 years ago

Glad you found the issue! In hindsight, should've known the byte vs. string issue would be a python2 vs. 3 underlying cause.