OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Create evaluation utility to compute residue conservation from MSA #61

Open jeffreyruffolo opened 8 months ago

jeffreyruffolo commented 8 months ago

Conservation of amino acids in multiple sequence alignments is an indicator of functional importance. In lieu of experimental function assays, one way to evaluate the design capabilities of our model is to measure how likely the model is to generate a sequence with correct functional residues.

Towards this goal, we need a utility to identify and quantify the conservation of particular amino acids in a sequence given an MSA. Given a query sequence and an MSA, the goal would be to compute some measurement of conservation (eg, entropy over amino acid distribution) for each position aligned to the query.

Consideration of alignment depth at each position would be a nice-to-have feature. Perhaps indicating positions with depth below some threshold with NaN/None values.