bguo068 / ishare

a Rust crate designed to facilitate the analysis of rare-variant sharing and identity-by-descent (IBD) sharing.
MIT License
3 stars 0 forks source link

update allele encoding #9

Open bguo068 opened 1 year ago

bguo068 commented 1 year ago

The current allele encoding method records changes for each rare allele at every site, allowing these alleles to be represented as integers from 1 to n in the GenotypeRecords, irrespective of the ALT/REF allele order per site.

Consider a site with REF = "T" and ALT = ["C", "A"]:

  1. When 'C' and 'A' are rare, and 'T' is common, the record "REF T>C T>A" is stored in the Sites struct.
  2. If 'T' and 'A' are rare, with 'C' being common, the record "REF T->A" is stored in Sites ('C' is not stored).
  3. When 'T' is rare, 'C' is common, and 'A' has an allele count of 0, only "REF" is recorded, disregarding the actual allele string/byte values of 'T' and 'C'.
  4. In cases where two alleles are common (e.g., 'T' and 'C'), and 'A' is rare, the record "REF T->A" is stored. For a genotype without a rare allele record, the genotype (of common alleles) is ambiguous.

This method works well for current functions but presents challenges when converting data back to BCF format (see also #5):

To address these issues:

bguo068 commented 1 year ago

updated allele encoding in https://github.com/bguo068/ishare/commit/4a6b17f6683df81cb2b251a1c64cbd5cd9283c8b