HallLab / pandas-genomics

Pandas ExtensionDtypes for dealing with genomics data
BSD 3-Clause "New" or "Revised" License
47 stars 8 forks source link

Add EDGE encoding #3

Closed jrm5100 closed 3 years ago

jrm5100 commented 4 years ago

From PLATO docs:

Elastic Data-driven Genetic Encoding (EDGE)

This weighted encoding is a hybrid between the traditional encodings and the codominant encoding. For each marker, the result from a univariate model (with appropriate covariates) is used to determine a heterozygous value from marker state to the set {0, x, 1}, where x is chosen such that the model with the encoded allele is identical to the codominant model. Then, this encoded allele is used in the multivariate models. Note that in the univariate and non-interaction case, this encoding is identical to the codominant encoding, but in the case of interactions, incurs fewer degrees of freedom.

jrm5100 commented 3 years ago

For now, a DataFrame will be passed as a parameter which includes for each variant:

Alpha values will be used for encoding, matching by variant ID.
Ref Allele and Alt allele must match. If needed, they may be swapped with a warning printed. Minor Allele Frequency will be used in the future, potentially having some required similarity.

jrm5100 commented 3 years ago