blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Test alternate encodings of sequence data for PCA inputs #22

Closed huddlej closed 2 months ago

huddlej commented 3 months ago

Instead of encoding nucleotide characters with their own integers, either try alternate binary encodings of these data or try a method that is designed for categorical data like Multiple Correspondence Analysis (MCA).

Revisiting McVean 2008, I confirmed that he used a binary encoding of biallelic genotypes as input for PCA with 0 for the reference allele and 1 for the alternate allele. As a first pass, we could test the biallelic approach with viruses by specifying a single reference sequence in an alignment, identifying alleles as 0 or 1 for reference or alternate, and reporting how often there are multiple alternate alleles. It is possible that we could encode any alternate allele as a 1, but that would throw away relevant distance information for the embedding.

Alternately, Stormo 2011 describes a "maximally efficient modeling of DNA sequence motifs" as a simplex encoding that improves model fitting compared to one-hot encoding.

A third option is to use MCA which works by one-hot encoding the categorical values for us and then running a correspondence analysis. The Prince package includes an implementation of MCA.