Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 31 forks source link

Reference not detected if lowercase characters #64

Open BiKC opened 4 years ago

BiKC commented 4 years ago

In V 1.2.0, if the reference was "a", and GTC said "A", this wouldn't be a problem. However, in V1.2.1, this is no longer the case and causes problems like 10 779284 GSA-rs2486591 g A,G . PASS

V1.2.0 would have been: 10 779284 GSA-rs2486591 G A . PASS

jjzieve commented 4 years ago

Ok, thanks for bringing this to our attention. I will look into it.

jjzieve commented 4 years ago

Having trouble reproducing this. Your reference genome.fa has lowecase characters, is that correct? e.g.

>1
atcg...

Also, which product are you running? GSA version 3?

BiKC commented 4 years ago

Indeed version 3, GRCh37. We do use a custom genome file however, that only contain the major chromosomes and mitochondrion.

The problem we have is that the genome has both upper and lowercase characters, and (what I think) when one of the lowercase letters is compared to the GTC files that have upper case characters, it thinks it isn't the same. This causes the problem in the VCF file which in turn causes problems further downstream our pipeline.

agcaaaaagggcctctctgaacagattctcatgctgcctgctatgtcagg agtaagcaccttctttgtctctgactcaggagtctcaggtcatgctacca tcatttatgaagttgtgattgctgaacatgttagattgcaaacgagtaaa caggtcagaccctttacTAAGTTGATACCACTTAATTGCATTCTGAATTC CTTGTTCTGCAACACTTCAAATGACAGAGGTTTCAGCCTCCAGCTAGATA TGGACTCTTAAAAAATGTCCTAATCAGAATTCTGTAGACTCTTTTACaca gaattctgggtacaaacatcctctgtactcagaactttgaatgtacgtgt atattgtctcctggtactggtgctgaggatgaggattccagaggcttact attcttttcctgatgtcctttaggtctgtttgttaaagcttttattgttt tcctcctggatgctttctggtctcctgttttgtacgtggtcttatgcaat

This is a screenshot of the output we had in version 1.2.0 (left) and 1.2.1 (right) with the exact same steps. MicrosoftTeams-image (1)

Here is another example of the issue: MicrosoftTeams-image (2)

Here an example of the error we get further downstream with an imputation tool called Beagle: MicrosoftTeams-image (3)

jjzieve commented 4 years ago

You're correct in that the bug exists if the reference genome is lowercase. I still wasn't able to reproduce it ever working when reverting back to 1.2.0 or even older though. Is it possible a different fasta file was used or something? I just pushed https://github.com/Illumina/GTCtoVCF/tree/bug/fix-lowercase-ref-genome can you confirm if that fixes the issue?