Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

BAF and LRR #14

Closed freeseek closed 5 years ago

freeseek commented 6 years ago

With GenomeStudio it is possible to output information such as BAF and LRR from the GTC files. I don't understand how it is possible to do the same using GTCtoVCF. It seems like it would be something very easy to add.

KelleyRyanM commented 6 years ago

Yes, this would be fairly straightforward. Currently, the FormatFactory is responsible for creating the list of formatters (e.g., GenotypeFormat) that will be used to generate the format field. A few questions

In terms of configuration, would you expect this to be configured with a command-line option like "--format-field=GQ,BF,LR"

Any thoughts about which VCF format fields would be used to convey this information?

Some VCF entries are a combination of multiple different assays. For these, the trivial solution would be to treat the BAF and LRR as missing data. But there could also be an attempt to identify cases where reporting the average would be appropriate.

freeseek commented 6 years ago

I used the following conventions:

FORMAT/IGC      Number:1  Type:Float    ..  Illumina GenCall Confidence Score
FORMAT/BAF      Number:1  Type:Float    ..  B Allele Frequency
FORMAT/LRR      Number:1  Type:Float    ..  Log R Ratio
FORMAT/NORMX    Number:1  Type:Float    ..  Normalized X intensity
FORMAT/NORMY    Number:1  Type:Float    ..  Normalized Y intensity
FORMAT/R        Number:1  Type:Float    ..  Normalized R value
FORMAT/THETA    Number:1  Type:Float    ..  Normalized Theta value
FORMAT/X        Number:1  Type:Integer  ..  Raw X intensity
FORMAT/Y        Number:1  Type:Integer  ..  Raw Y intensity

I did not understand though that only GTC files generated with AutoConvert 2.0 include the BAF and LRR information and that otherwise this information requires both the BPM and EGT files to be recovered if not included in the GTC file to begin with.

I think when multiple entries are present treating as missing data is a good approach.

jjzieve commented 5 years ago

See latest version (1.2.0) it should support BAF and LRR output