Closed jrm5100 closed 3 years ago
For now, a DataFrame will be passed as a parameter which includes for each variant:
Alpha values will be used for encoding, matching by variant ID.
Ref Allele and Alt allele must match. If needed, they may be swapped with a warning printed.
Minor Allele Frequency will be used in the future, potentially having some required similarity.
From PLATO docs:
Elastic Data-driven Genetic Encoding (EDGE)
This weighted encoding is a hybrid between the traditional encodings and the codominant encoding. For each marker, the result from a univariate model (with appropriate covariates) is used to determine a heterozygous value from marker state to the set {0, x, 1}, where x is chosen such that the model with the encoded allele is identical to the codominant model. Then, this encoded allele is used in the multivariate models. Note that in the univariate and non-interaction case, this encoding is identical to the codominant encoding, but in the case of interactions, incurs fewer degrees of freedom.