facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.25k stars 642 forks source link

metadata dataframe content, add sequence crc64 #350

Open tomsercu opened 2 years ago

tomsercu commented 2 years ago

Discussed in https://github.com/facebookresearch/esm/discussions/340

Originally posted by **igortru** November 4, 2022 please, add protein sequence crc64 column to metadata file (or any other hash , md5 for example) it allow easily map sequences from different databases : ebi-embl,alphafold,genbank,mgnify,etc. as template , you can take alphafold metainformation table in GCP. https://github.com/deepmind/alphafold/blob/main/afdb/README.md mapping file between genbank and alphafold you can find on https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AlphaFold2NR.map.gz id is the MGnify ID ptm is the predicted TM score plddt is the predicted average lddt num_conf is the number of residues with plddt > 0.7 len is the total residues in the protein crc64 from crc64iso.crc64iso import crc64
igortru commented 2 years ago

http://www0.cs.ucl.ac.uk/staff/d.jones/crcnote.pdf known problem with alphafold/uniprot crc64 :

I have found ~500 sequences pairs in AF universe with the same crc64, but different only in two positions with step 8: looks like very rare event.

Best decision will be switch from crc64 to md5 in all 3D-S databases simultaneously. But I don't think we have such level of cooperation/synchronization between interested parties. crc64 - is the best choice for now.