Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 31 forks source link

Tool assumptions and behaviour #80

Open rajwanir opened 3 months ago

rajwanir commented 3 months ago

Hi @jzieve,

As a fix to #78, I might need to use a different csv beadpool manifest version than the orignal manifest used to generate idat/gtc. Many of new manifests will have fewer loci than the manifest. To ensure the resulting vcf is correct, may I please confirm some assumptions of the tool:

  1. Assumes snps to be in the same order between beadpool manifest and gtc.
  2. Manifest file name encoded in gtc should match the one provided as the parameter.
  3. Does not need any sequence column for SNP records. No effect or consequence.
  4. Will skip snp records that do not have ref_strand set.
  5. Sequence is required for indel records. It can be bypassed with a fake sequence. It will fail on those records and then skip.

Do you see any simple change to the code that will retrieve the genotypes from gtc based on the order in manifest (e.g. by querying through names) and not assuming the same order?

jzieve commented 3 months ago

In general, I wouldn't recommend this. I would recommend regenerating the GTCs with the new manifest. You should be able to use illumina's new tool for that, DRAGEN Array: https://support.illumina.com/array/array_software/dragen-array-secondary-analysis/downloads.html.

For the assumptions though if you proceed with using the old manifest: 1) Yes. However its a bit more complicated in that the normalization transforms are encoded in the GTC as well and different ordering will likely cause issues (partially my reasoning behind recommending just regenerating the GTCs). 2) This shouldn't matter for this tool. However DRAGEN Array's gtc-to-vcf command does check if the manifests are the same and will error out if they aren't. 3) Correct. 4) No, it needs the ref_strand to determine the plus_strand_alleles (i.e., https://github.com/Illumina/GTCtoVCF/blob/3ba17b8e80deb0a81a49b2ae5de6d78d6dfa332a/BPMRecord.py#L201). That field downstream effectively becomes the REF/ALT in the VCF when compared with the reference genome. You could hack it to just set plus_strand_alleles to what is top and bot if you want. 5) Correct. It can also be bypassed by just using the BPM manifest without the SourceSeq and providing the --skip-indels option.

Unfortunately, because the loci names are not actually encoded in the GTC (see here: https://github.com/Illumina/BeadArrayFiles/blob/develop/docs/GTC_File_Format_v5.pdf) and relies on an ordered match/lookup from the manifest, I don't see an easy way to treat the GTC like a dictionary with the loci names as keys. You could use https://github.com/Illumina/BeadArrayFiles to achieve that by parsing the GTCs and creating whatever data structure you want (or your own VCFs) . But you would need the original manifests for the old GTCs regardless. I still think it would be less work to regenerate the GTCs and then use either this or DRAGEN Array's gtc-to-vcf functionality. The IDATs should still work for the newer manifests (they are less sensitive to subsetting like the GTCs).

Hope that helps.

rajwanir commented 3 months ago

Thanks so much @jzieve for the prompt response. I agree on the issues of creating GTCs with the new manifest. My thought is to create GTCs with orignal manifest and then VCFs with the new manifest which may have fewer snps and updated annotations. I missed the point that GTCs do not have the loci_names encoded so the order is critical. So essentially, if I wish to use the new manifest to convert GTC>VCF, the new manifest should have the same snp set in the same exact order as the orignal manifest. The annotations (e.g. chrom coordinates, ref_strand etc) can be different but the snps can't be filtered out or ordering can't be changed between new and orignal manifest.

I want to use DRAGEN Array's gtc-to-vcf but it's not open source and has restrictions which cannot be bypassed. For instance, it requires bpm and csv_manifest for execution. So in my case to use a different bpm and csv manifest version would need to extensively tested. Additionally, for quite a few issues, it throws an error instead of a warning, no option to --skip-indels . This tool offers more flexibility because of the open source. I would want to switch to DRAGEN Array's gtc-to-vcf for it's updates and bug fixes if it was open source and had somewhere to discuss the issues.

Thanks a lot for the clarification. Appreciate it very much.

rajwanir commented 3 months ago

Hello @jzieve

May I please confirm if the impact of the warnings below is restricted to the said record? OR once a record fails anything downstream would be incorrect in vcf due to GTC order being highly important?

Examples of the warnings I see in the log:

  1. Failed to process entry for record
  2. Incomplete match of source sequence to genome for indel
  3. Reference allele is not queried for locus

I attach the full log for your reference. gtc_to_vcf.log

Per above discussion, I tried IDAT>GTC with orignal manifest (m1). Subseqeuntly, for GTC>VCF (using GTCtoVCF) I used a different manifest (m2). m2 has updated GenomeBuild, RefStrand, SourceSeq annotations since these are missing in the m1. I ensured both m1 and m2 contains the identical number of records and in the same order as m1. I prepared m2 by a simple left join to m1 and importing the needed annotations. m2 is almost complete for needed annotations. About ~7% of records in m2 do not have the RefStrand set or the SourceSeq is fake with an understanding these records will fail and be skipped from the output.

I ran GTCtoVCF with m2 and it executes fine as expected. The output VCF looks fine to me with those 7% records skipped as I expect. The log (attached) also indicates that these records failed. I am not concerned about these 7% records failed or skipped. However, since order of the gtc records is critical, I wanted to check would failure at any record would result in an incorrect genotype for all subsequent records in the manifest? Or is it a true skip such that both manifest and gtc handler will skip the failed record and move on to the next record such that both chr annotations and genotype still will be in sync and correct in the output vcf.

Thank you.

jzieve commented 3 months ago

Reading through your log I see a lot of Failed to process entry for record 1180: invalid literal for int() with base 10: ''. Not sure where/why that is happening. Which reference genome are you using?

Regardless, the warnings/failed records are a "true skip" (i.e., it shouldn't impact records downstream). Though again, I would caution using different manifests for GTC creation and VCF creation. If they have the exact same number of loci and same order it should work in theory but I would imagine they don't unless they are a special manifest with only the additional annotations (SourceSeq, RefStrand, etc.).

jzieve commented 3 months ago

Thanks so much @jzieve for the prompt response. I agree on the issues of creating GTCs with the new manifest. My thought is to create GTCs with orignal manifest and then VCFs with the new manifest which may have fewer snps and updated annotations. I missed the point that GTCs do not have the loci_names encoded so the order is critical. So essentially, if I wish to use the new manifest to convert GTC>VCF, the new manifest should have the same snp set in the same exact order as the orignal manifest. The annotations (e.g. chrom coordinates, ref_strand etc) can be different but the snps can't be filtered out or ordering can't be changed between new and orignal manifest.

I want to use DRAGEN Array's gtc-to-vcf but it's not open source and has restrictions which cannot be bypassed. For instance, it requires bpm and csv_manifest for execution. So in my case to use a different bpm and csv manifest version would need to extensively tested. Additionally, for quite a few issues, it throws an error instead of a warning, no option to --skip-indels . This tool offers more flexibility because of the open source. I would want to switch to DRAGEN Array's gtc-to-vcf for it's updates and bug fixes if it was open source and had somewhere to discuss the issues.

Thanks a lot for the clarification. Appreciate it very much.

w.r.t DRAGEN Array vs. this tool, I understand the desire for open source but just a FYI, this tool is currently "semi-archived" (https://github.com/Illumina/GTCtoVCF?tab=readme-ov-file#semi-archived-state). Meaning, it won't be getting new development aside from critical bug fixes and will likely be fully archived at some point. techsupport@illumina.com should be able to take feedback as well as respond to issues and we hope to improve DRAGEN Array's gtc-to-vcf to make the VCF creation more seamless.

rajwanir commented 3 months ago

Thanks @jzieve. The reference genome is build37. The issue on the Failed to process entry for record 1180: invalid literal for int() with base 10: '' aligns with specific indel records where I am using an immitation SourceSeq since I don't have have the SourceSeq information for those records. I am fine with the skipping those records as long as it doesn't impact any downstream records.

That's great that impact of warnings/failed records will be contained to the specific records and genotype calls for any subsequent records would still be accurately recorded in the vcf.

In my case, I have quite a bit of old manifests/chips to deal with which do not have the RefStrand and SourceSeq columns in the orignal manifest. Some of them are custom chips too. So this seems the only feasible way forward I can think of is to use the orignal manifest (so the number of records and order is unchanged) and import the missing column from any updated manifest or external which has this information. In essence, I am creating a custom/special manifest just for this purpose keeping all the contraints in mind. My initial estimates are that I might be able to fill in the RefStrand and SourceSeq for most but not all of the records. Hence, confirmation on the impact of these failed records was important. Completely agree though. Would need to be cautious. I will test this strategy more and let you know if I run into additional questions.

Thanks a lot for your help.