freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

Feature request: alternative genome reference for --genome-studio input #41

Closed robkar closed 1 year ago

robkar commented 2 years ago

Hello, and thanks for a great tool!

I am working on some older genotype data (on the PsychChip) where the IDAT files have unfortunately been lost to time, but where we do have a reasonably rich GenomeStudio text format export, and the original csv manifest file used when generating the export. I want to combine this with newer genotyping waves where we do have the IDATs, and would like to remap the markers using gtc2vcf to hopefully be done with strand and allele issues once and for all. But currently gtc2vcf does not permit --genome-studio to be used with --csv and/or --sam-flank.

Would it be possible to extend gtc2vcf to this use case, or is there some vital information I am missing that makes it a bad idea or impossible?

The GS export has columns (followed by 6-15 repeated for each sample):

1: Index
2: Name
3: Address
4: Chr
5: Position
6: S1.GType
7: S1.Score
8: S1.Theta
9: S1.R
10: S1.X Raw
11: S1.Y Raw
12: S1.X
13: S1.Y
14: S1.B Allele Freq
15: S1.Log R Ratio
16: ...

My csv manifest has columns:

1: IlmnID
2: Name
3: IlmnStrand
4: SNP
5: AddressA_ID
6: AlleleA_ProbeSeq
7: AddressB_ID
8: AlleleB_ProbeSeq
9: GenomeBuild
10: Chr
11: MapInfo
12: Ploidy
13: Species
14: Source
15: SourceVersion
16: SourceStrand
17: SourceSeq
18: TopGenomicSeq
19: BeadSetID