freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

NORMX/NORMY/R/THETA missing from GenomeStudio text output #30

Closed robkar closed 3 years ago

robkar commented 3 years ago

Thanks for an excellent tool! I have been trying to use it to generate input for a CNV calling pipeline, and was pleased to discover the -Ot option for GenomeStudio text format export, which looked close enough to the format I needed. However, it seems some fields that make it to the VCF output are not exported to the text format.

Specifically, the ones I miss are NORMX/NORMY/R/THETA. I checked the code of gtcs_to_gs, and all the missing fields seem to depend on BPM_LOOKUPS being set. I couldn't see a reason why it wouldn't be though, so maybe this is the wrong track.

Exporting the same collection of GTCs to VCF had the proper format tags included.

This call:

bcftools +${GTC2VCF} \
        -Ot \
        --bpm ${BATCH1_MFT_BPM} \
        --csv ${BATCH1_MFT_CSV} \
        --egt ${BATCH1_EGT} \
        --gtcs ${GTCDIR}/${BATCH1_NAME} \
        --fasta-ref ${REF} > ${OUT_PREFIX}.FDT.tsv

Produces output with these columns (truncated):

Index
Name
Address
Chr
Position
GenTrain Score
Frac A
Frac C
Frac G
Frac T
204379800081_R02C02.GType
204379800081_R02C02.Score
204379800081_R02C02.B Allele Freq
204379800081_R02C02.Log R Ratio
204379800081_R02C02.X Raw
204379800081_R02C02.Y Raw
204379800081_R02C02.Top Alleles
204379800081_R02C02.Plus/Minus Alleles
204379800081_R02C01.GType
204379800081_R02C01.Score
204379800081_R02C01.B Allele Freq
204379800081_R02C01.Log R Ratio
204379800081_R02C01.X Raw
204379800081_R02C01.Y Raw
...

While an equivalent call requesting vcf output:

bcftools +${GTC2VCF} \
        -Ou \
        --bpm ${BATCH1_MFT_BPM} \
        --csv ${BATCH1_MFT_CSV} \
        --egt ${BATCH1_EGT} \
        --gtcs ${GTCDIR}/${BATCH1_NAME} \
        --fasta-ref ${REF} \
        --extra ${OUT_PREFIX}.tsv | \
        bcftools sort -Ou -T $TMPDIR/bcftools-sort.XXXXXX | \
        bcftools norm -Oz -o ${OUT_PREFIX}.vcf.gz -c x -f $REF

produces a VCF with the expected format tags:

GT:GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y

Tested on the stable version from http://software.broadinstitute.org/software/gtc2vcf/ and the current github version getting the same results.

I can query the VCF to get the data I need, but thought I should report this since the behavior was unexpected.