The current allele encoding method records changes for each rare allele at every site,
allowing these alleles to be represented as integers from 1 to n in the GenotypeRecords, irrespective of the ALT/REF allele order per site.
Consider a site with REF = "T" and ALT = ["C", "A"]:
When 'C' and 'A' are rare, and 'T' is common, the record "REF T>C T>A" is stored
in the Sites struct.
If 'T' and 'A' are rare, with 'C' being common, the record "REF T->A" is
stored in Sites ('C' is not stored).
When 'T' is rare, 'C' is common, and 'A' has an allele count of 0,
only "REF" is recorded, disregarding the actual allele string/byte values of 'T' and 'C'.
In cases where two alleles are common (e.g., 'T' and 'C'), and 'A' is rare,
the record "REF T->A" is stored. For a genotype without a rare allele record,
the genotype (of common alleles) is ambiguous.
This method works well for current functions but presents challenges when converting data back to BCF format (see also #5):
Caveat 1: When the REF allele is the sole rare allele, the string/byte values of both the REF and common alleles are lost in tabular encoding. Although retrievable from the original VCF/BCF file, this is not convenient or ideal.
Caveat 2: With multiple common alleles, the common allele index cannot be inferred from the tabular encoding of rare genotypes.
To address these issues:
Store the REF allele and all ALT alleles with an allele count (AC) > 0. Rare alleles in GenotypeRecords would then correspond to an integer index based on the order of stored REF/ALT alleles. This approach resolves Caveat 1.
For sites with multiple common alleles, we wouldn’t explicitly indicate which alleles are common but would infer this by noting that the alleles in GenotypeRecords are rare. This approach doesn’t directly resolve Caveat 2 but signals sites with multi-common-allele issues.
One solution is to represent all common alleles at a multi-common-allele site by selecting the first common allele, alerting users to potential issues in the exported BCF.
Additionally, we could offer options to filter out these sites for further clarity.
The current allele encoding method records changes for each rare allele at every site, allowing these alleles to be represented as integers from 1 to n in the GenotypeRecords, irrespective of the ALT/REF allele order per site.
Consider a site with REF = "T" and ALT = ["C", "A"]:
This method works well for current functions but presents challenges when converting data back to BCF format (see also #5):
To address these issues: