freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

Any suggestions on handling manifests with missing RefStrand and SourceSeq columns? #63

Closed rajwanir closed 2 months ago

rajwanir commented 3 months ago

Hi,

I am compiling a data catalog for my department where the data set span 27+ manifests which come from different times some of them pre-2009 era. Many of these manifests do not have the RefStrand and Sourceseq information either in the binary bpm or the csv version. My goal is to use these manifests to convert GTCs to VCFs. I assume the gtc2vcf plugin absolutely requires these column or atleast the sourceseq column to compute the RefStrand by aligning the flank sequences. Could you suggest a work around in absence of both these columns?

I have following columns consistently populated across different manifests that I am working:

address_a_id chr genome_build ilmn_id ilmn_strand map_info name ploidy snp source source_strand source_version species

And I am working with following set of manifests for now:

HumanOmni2.5-4v1_B.csv GSAMD-24v1-0_20011747_A1.csv GSAMD-24v2-0_20024620_B1.csv Human610-Quadv1_B.csv Human660W-Quad_v1_A.csv Cardio-Metabo_Chip_11395247_C.csv Rare_Cancer_272049_A.csv HumanOmniExpress-12v1_A.csv Peguses_FU_11602373_A.csv Human1M-Duov3_B.csv HumanHap550v3_B.csv Consortium-OncoArray_15047405_A.csv Cancer_BeadChip_11459870_B.csv Immuno_BeadChip_11419691_B.csv CGEMS_P_F2_272225_A.csv Breast_Wide_Track_271628_A.csv BDCHP-1X10-HUMANHAP550_11218540_C.csv HumanOmni2.5S-8v1_B.csv HumanOmni1-Quad_v1-0_B.csv HumanExome-12v1_A.csv HumanOmni1S-8v1_A.csv GSAv3Confluence_20032937X371431_A1.csv HumanOmni25-4v1_C.csv HumanOmni2.5-8v1_A.csv HumanOmni2.5-8v1_A.csv HumanOmni5-4v1_B.csv

Much appreciate if you could share any possible work around. Thank you.

freeseek commented 3 months ago

If you have both the .bpm and the .csv manifest files, you can simply realign the flanking sequences as explained here. If you provide alignments to BCFtools/gtc2vcf as an input with the option --sam-flank you will not need the RefStrand column in the .bpm or .csv manifest files. I assume the flanking sequences should be in the .csv manifest files. I have not personally seen .csv manifest files without flanking sequences

rajwanir commented 3 months ago

Thanks a lot for your prompt response. Here is the first record for some of the manifests I am dealing with that do not have any sequences in .csv manifest. Do you have any pointers on how to get the RefStrand for them? These are retired chips so not supported by Illumina anymore.

manifest/HumanExome-12v1_A.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand manifest/HumanExome-12v1_A.csv-exm999982-0_T_R_1922542276,exm999982,TOP,[A/C],0,1,4703268,,0,,37.1,12,49445526,diploid,Homo sapiens,ExomeSNPs,0,BOT,,,BOT,

manifest/HumanOmni2.5S-8v1_B.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand manifest/HumanOmni2.5S-8v1_B.csv-rs998383-131_T_F_1908644644,rs998383,TOP,[C/G],2,219,92733354,,85682455,,37.1,21,37445739,diploid,Homo sapiens,dbSNP,131,TOP,,,TOP,+

HumanOmni2.5-4v1_B.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand,AddressA_ID HumanOmni2.5-4v1_B.csv-VGXS35706-0_T_R_1569787094,VGXS35706,TOP,[A/G],0,20,,0,,36,X,100543394,diploid,Homo sapiens,Phencode,0,TOP,,,TOP,-,0029742399

/Human1M-Duov3_B.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand /Human1M-Duov3_B.csv-cnvi0048855-0_P_F_1533402900,cnvi0048855,P,[N/A],0,3,540704401,,0,,36.2,4,97084207,diploid,Homo sapiens,Illumina assay db,0,P,,,P

/Manifests/HumanOmni2.5-8v1_A.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand /Manifests/HumanOmni2.5-8v1_A.csv-rs9998545-131_B_F_1893938319,rs9998545,BOT,[T/C],0,5,35626273,,0,,37.1,4,183931994,diploid,Homo sapiens,dbSNP,131,BOT,,,BOT,-

/Manifests/HumanOmniExpress-12v1_A.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand /Manifests/HumanOmniExpress-12v1_A.csv-VGXS35706-0_T_R_1569787094,VGXS35706,TOP,[A/G],0,4,29742399,,0,,36,X,100543394,diploid,Homo sapiens,Phencode,0,TOP,,,TOP,-

/HumanHap550v3_B.csv:IlmnID,Name,IlmnStrand,SNP,AssayTypeID,NormID,AddressA_ID,AlleleA_ProbeSeq,AddressB_ID,AlleleB_ProbeSeq,GenomeBuild,Chr,MapInfo,Ploidy,Species,Source,SourceVersion,SourceStrand,SourceSeq,TopGenomicSeq,CustomerStrand,GenomicStrand /HumanHap550v3_B.csv-rs3104240-126_B_R_IFB1141262046:0,rs3104240,Bot,[T/C],0,19,803420739,,0,,36,18,24521689,2,Homo sapiens,dbSNP;refSNP,126,TOP,,,TOP,

rajwanir commented 2 months ago

Closing the issue with the final comments: For most commercial chips, Illumina has revised the manifests (e.g. HumanHap550v3_A to HumanHap550v3_B, _C or _H). The revised manifests typically has better annotations for SourceSeq, RefStrand and GenomeBuild. However, there are chips that illumina retired long time ago and updated manifests aren't available. Also there are chips that are custom and updated manifests with SourceSeq will need to be requested from Illumina. There is no analytical solution in absence of the SourceSeq or atleast some sequence column.