freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

Failed to read 1359180426 bytes when convert gtc files #28

Closed whecrane closed 3 years ago

whecrane commented 3 years ago

Dear Giulio, Thank you for developing such a good tooI to deal with idat files. I have converted gtc files from idat successfully, thank you for your suggestion. When I run the code just like the guide, an error occured and I saw someone have the similar issue, but not suitable for me (https://github.com/freeseek/gtc2vcf/issues/13). I used the -gtcs, the folder have 103 gtc files and less files still have the same error.

$bcftools +gtc2vcf \

--no-version -Ou \

--bpm $bpm_manifest_file \

--csv $csv_manifest_file \

--egt $egt_cluster_file \

--gtcs $path_to_gtc_folder \

--fasta-ref $ref \

--extra $out_prefix.tsv

gtc2vcf 2020-08-26 https://github.com/freeseek/gtc2vcf

Reading BPM file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1.bpm

Reading CSV file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1.csv

Reading EGT file /media/EXTend2018/Wanghe2019/GEO/GSE113093/InfiniumPsychArray-24v1-1_A1_ClusterFile.egt

Reading GTC file /media/EXTend2018/Wanghe2019/GEO/GSE113093/GSE113093_GTC/GSM3096512_200687150051.gtc

Failed to read 1359180426 bytes from stream

Best wishes, Crane

freeseek commented 3 years ago

What happens when you run this command:

$ bcftools +gtc2vcf /media/EXTend2018/Wanghe2019/GEO/GSE113093/GSE113093_GTC/GSM3096512_200687150051.gtc

I would assume you get the same error, but that's puzzling. What system are you running this on? How did you generate the gtc file from the idat files? Can you share the gtc file if the error reproduces?

whecrane commented 3 years ago

Yes, you are right, when I run the code you give me, it's the same error. I ran in CentOS Linux release 7.5.1804. I generated the gtc file from the idat files. Here is my gtc file. GSM3096512_200687150051.gtc.gz

freeseek commented 3 years ago

Well, this is an incredible bug in Illumina's iaap-cli tool, caused by the use of UTF-8.

It seems that while you convert the IDAT files to GTC using the Illumina tool, you end up inserting the following string in the file:

10-29-2020 10:02 上午

While if I run the command on my computer I would have inserted the following string:

10/29/2020 10:02 AM

The Illumina tool believes that to insert your string you need 19 bytes. However, since the characters "上午" are encoded in UTF-8 as \xe4\xb8\x8a\xe5\x8d\x88 the string ends up being 23 bytes long instead. This shifts the location of everything else within the GTC file and makes the table of content within the GTC file completely wrong, so that you cannot random access any data within the file.

Indeed you can try the following (which converts the two UTF-8 characters to regular ASCII character):

sed 's/\xe4\xb8\x8a\xe5\x8d\x88/AM/' GSM3096512_200687150051.gtc > 200687150051_R01C01.gtc
bcftools +gtc2vcf 200687150051_R01C01.gtc

And it works fine as the table of contents is now correctly synced with the rest of the file again.

whecrane commented 3 years ago

It worked, thank you very much. You are so kind. Thanks!

freeseek commented 3 years ago

@whecrane I was able to reproduce your issue with the following command:

LANG="zh_CN.UTF-8" iaap-cli [manifest] [cluster-file] . -f . -g

So my guess is that you can use this command to fix the issue on your side:

LANG="en_US.UTF-8" iaap-cli [manifest] [cluster-file] . -f . -g

I will include this in the documentation. Thank you for bringing this to my attention.