Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 31 forks source link

Any workaround for missing RefStrand and SourceSeq column in the manifest? #78

Closed rajwanir closed 3 months ago

rajwanir commented 4 months ago

Hello,

Thanks for this very useful tool. I am working on a compilation of data catalog. The dataset span many different beadpool manifests across time some of them pre-2009 or so. I have both binary bead pool manifest as well the csv version.

I wish to use the GTCtoVCF to convert the entire dataset to VCF for downstream processing. However since these beadpool manifests are older most does not have the RefStrand column and many do not have the SourceSeq column. I am fine with --skip-indels and focus on the single-nucleotide-polymorphsims, however, the RefStrand is abosolutely required and missing in both the binary and csv version.

Could you please advise on possible workaround to infer the RefStrand column? I have following columns populated consistently across chips:

address_a_id chr genome_build ilmn_id ilmn_strand map_info name ploidy snp source source_strand source_version species

I am working with the following manifests:

HumanOmni2.5-4v1_B.csv GSAMD-24v1-0_20011747_A1.csv GSAMD-24v2-0_20024620_B1.csv Human610-Quadv1_B.csv Human660W-Quad_v1_A.csv Cardio-Metabo_Chip_11395247_C.csv Rare_Cancer_272049_A.csv HumanOmniExpress-12v1_A.csv Peguses_FU_11602373_A.csv Human1M-Duov3_B.csv HumanHap550v3_B.csv Consortium-OncoArray_15047405_A.csv Cancer_BeadChip_11459870_B.csv Immuno_BeadChip_11419691_B.csv CGEMS_P_F2_272225_A.csv Breast_Wide_Track_271628_A.csv BDCHP-1X10-HUMANHAP550_11218540_C.csv HumanOmni2.5S-8v1_B.csv HumanOmni1-Quad_v1-0_B.csv HumanExome-12v1_A.csv HumanOmni1S-8v1_A.csv GSAv3Confluence_20032937X371431_A1.csv HumanOmni25-4v1_C.csv HumanOmni2.5-8v1_A.csv HumanOmni2.5-8v1_A.csv HumanOmni5-4v1_B.csv

I would appreciate any suggestion or pointers. Thank you.

jzieve commented 4 months ago

Hey @rajwanir, unfortunately the RefStrand is required to figure out the context of the SNP and what is REF/ALT w.r.t the genome. I don't see a workaround without a slight code change but I do see a potentially pretty easy way to meet your needs if you're willing. On lines https://github.com/Illumina/GTCtoVCF/blob/cc3f2c0e359e4f5d7626e56b4c1c3b8b1f2cab4f/BPMReader.py#L101 and https://github.com/Illumina/GTCtoVCF/blob/cc3f2c0e359e4f5d7626e56b4c1c3b8b1f2cab4f/BPMReader.py#L141 you can remove sourcestrand from the required columns and on line: https://github.com/Illumina/GTCtoVCF/blob/cc3f2c0e359e4f5d7626e56b4c1c3b8b1f2cab4f/BPMReader.py#L144, skip the setting of the IndelSourceSequence.

Now using the csv manifest as input, you just need to spoof the RefStrand (assuming you don't care about what the true REF/ALT is). So essentially add that column to your csv manifests and set them all to "+". If you do care about the true REF/ALT, you would need to make a script to map the SNPs to the reference genome and find out if they are on that strand or the reverse compliment ("-") strand. Let me know if that makes sense/helps.

rajwanir commented 4 months ago

Thanks you so much for your response and references to the code to skip the source strand check and indels. Spoofing RefStrand is an straightforward option, however, in that case the REF/ALT in the VCF would be incorrect, right? I think, the resulting VCF may not be much useful for downstream if the REF/ALT would be an imitation too.

I am up for mapping SNPs to the reference genome, however, not clear on how would I go about it without the sourceseq or in many cases any sequence in the csv manifest. I feel doubtful on the accuracy if I use the position to retrieve the reference genome base and then infer +/- based on the SNP. Is this how you would think too mapping SNPs to the reference genome?

Thanks

jzieve commented 4 months ago

Yes, for a "correct" VCF, the true REF/ALT would be necessary. You would need some sequence to accurately map (though even this can be error prone). I think either probe_a or address_a_id should be a sequence available in those manifests? You could use that and chr+pos/map_info to figure out the refstrand via counting the number of matched bases for that sequence and the reverse compliment of that sequence. If the former has more matches (and is maybe above some threshold like 80%) then refstrand=+, if the revcomp has more then refstrand=-.

jzieve commented 4 months ago

Sorry, just checked a manifest, the addresses wouldn't help. Is there at least a AlleleA_ProbeSeq?

rajwanir commented 4 months ago

Thanks a lot @jzieve for sharing your thoughts and prompt response. Unfortunately, No sequence column is consistently across manifests I am working with. These are pretty old ones. Roughly 50% of the manifests have the AlleleA_ProbeSeq populated so there is hope to recover the RefStrand based on the logic you just described. However, for the ones that do not have the AlleleA_ProbeSeq or any other sequence column populated, it's nearly impossible. The chr+pos/map_info alone cannot be used to infer the RefStrand I think. What do you think?

Here is the list of columns that I have populated across the manifests: address_a_id chr genome_build ilmn_id ilmn_strand map_info name ploidy snp source source_strand source_version species

jzieve commented 4 months ago

I agree without any sequence it would be nearly impossible to get an accurate mapping. I suppose you could always just map the SNP and accept that it would be pretty error prone. I would recommend when you do have the AlleleA_ProbeSeq to try the method I described. For the other manifests, at least the commercial ones like HumanExome-12v1_A.csv, you may be able to inquire at techsupport@illumina.com and they may be able to get some probe sequences for you based on the ilmn_ids. Or perhaps even get you a newer manifest with those columns populated? However I don't know the policy on custom products, you may need to verify ownership of some designs. Hope that helps.

rajwanir commented 3 months ago

Thanks a lot for the discussion @jzieve . I checked a bit. You are correct. For most commerical chips, there has been revisions to the manifest which might fix some of these missing sequence or reference strand issue. This is usually added with a letter suffix (e.g. _A, _B, _C) appended to the manifest file name. For custom chips or ones where there is no update, it will remain to be very tricky and the only feasible way to move forward seems like to obtain at least the sequence information first. Thank you.