Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
43 stars 30 forks source link

better documentation on `--unsquash-duplicates` #4

Closed stephenturner closed 6 years ago

stephenturner commented 6 years ago

Looked at the code but couldn't grok what this flag is doing.

https://github.com/Illumina/GTCtoVCF/blob/27bbb1f99d341489aac6754c31864e6333219995/LocusEntryFactory.py#L68

Add slightly more documentation in the printed help, possibly an example here?

Thanks 🙇

KelleyRyanM commented 6 years ago

Hi Stephen, I'll try to add some additional detail in this issue and then update the README.md based on any areas of insufficient detail.

In the manifest, there can be cases where the same variant is probed by multiple different assays. These assays may be the same design or alternate designs for the same locus. These types of cases are the duplicates referenced in the above option. In the default mode of operation, these duplicates will be "squashed" into a single record in the VCF. The method used to incorporate information across multiple assays is described in the README under the "Output description" heading.

When the "--unsquash-duplicates" option is provided, this "squashing" behavior is disabled, and each duplicate assay will be reported in a separate entry in the VCF file. This option is helpful when you are interested in investigating or validating the performance of individual assays, rather than trying to generate genotypes for specific variants.

Does that help clarify the expected behavior and use case?

stephenturner commented 6 years ago

Ah, mea culpa - I neglected to revisit the README, relying only on the help in the README. This actually becomes important. I'm using a custom design that has redundancy in quadruplicate on both strands for highly important SNPs. That is, we may have 8 designs for a single SNP. However, half of these (should) yield a genotype on one strand, e.g., A/G for a heterozygote, but the other half might report T/C. I'm assuming this is handled properly for squashed duplicates? Finally, what if 8 are consistent but only 1 is inconsistent, and being a custom product, may actually be the case - we've noticed some assays consistently underperform, and have tried to clean these up, but may not have gotten them all.

Stephen

Sent from mobile.

On Jan 30, 2018, at 2:30 PM, KelleyRyanM notifications@github.com wrote:

Hi Stephen, I'll try to add some additional detail in this issue and then update the README.md based on any areas of insufficient detail.

In the manifest, there can be cases where the same variant is probed by multiple different assays. These assays may be the same design or alternate designs for the same locus. These types of cases are the duplicates referenced in the above option. In the default mode of operation, these duplicates will be "squashed" into a single record in the VCF. The method used to incorporate information across multiple assays is described in the README under the "Output description" heading.

When the "--unsquash-duplicates" option is provided, this "squashing" behavior is disabled, and each duplicate assay will be reported in a separate entry in the VCF file. This option is helpful when you are interested in investigating or validating the performance of individual assays, rather than trying to generate genotypes for specific variants.

Does that help clarify the expected behavior and use case?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

KelleyRyanM commented 6 years ago

Hi Stephen, Scenarios where there are more than two designs, designs on different strands, or a combination of Infinium I and Infinium II designs are handled. The expected behavior in this case is that if there is at least one discrepant call among the replicates, then a no-call will be assigned in the VCF. The discrepancy must be an specific discrepant genotype, rather than a no-call. This strategy is intended to optimize accuracy over call rate for these duplicate scenarios.

stephenturner commented 6 years ago

Thanks. Looks like I'll need to carefully review the high-prio SNPs we care about the most for duplicate discordance.

KelleyRyanM commented 6 years ago

For what it's worth, the previously discussed loci filtering will execute before any aggregation, so there is an opportunity to trim these out before they impact the variant call in the VCF.

KelleyRyanM commented 6 years ago

https://github.com/Illumina/GTCtoVCF/commit/e66cf7e2667ae61d968076782c5ba965773dcd5a