OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Metrics for CL #68

Open Muedi opened 3 months ago

Muedi commented 3 months ago

Hi,

I had this lying around for some time, but wanted top open a draft request now finally.

I added a script, containing functions, that reads in A fasta file and compute the Shannon-entropy and KL-divergence per seq based on the sequences in that file. It always builds a dict, containing the frequencies of AAs to work with. The frequencies in question are OVERALL and not based on alignment. This was by choice as I think its much faster and I don't think aligning multi million seqs is practicable :D

There are old commits shown as not integrated, because they where merged into one last time I think. I kept everything as is, because there are some changes in the scripts folders (unifying scripts and script.py).

I also planned to write a function that takes the UNIPROT accession from the fastas and gets the PPL metrics of AF2 from google cloud, but I did not have an example fasta.

Best, Max