facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

stats.parquet file for ESM Metagenomic Atlas has duplicate entries #376

Closed tomgoddard closed 1 year ago

tomgoddard commented 2 years ago

Here is an example of the duplicate entries in the stats.parquet file with the same MGnify id. The are the same except for differing ptm values. It seems there should not be duplicates since I believe each MGnify id has only one structure prediction in the atlas.

                         id    ptm  plddt  num_conf  len                                                                                 
521608462  MGYP000000011531  0.679  0.842        28   33                                                                                 
577478634  MGYP000000011531  0.490  0.842        28   33                                                                                 
577528622  MGYP000000011531  0.622  0.842        28   33  

This is the stats.parquet file I used is

https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet

The link to the stats.parquet file

https://dl.fbaipublicfiles.com/esmatlas/v0.0/stats.parquet

on the Atlas API web page

https://esmatlas.com/about#api

is broken, gives an Access Denied error.

tomsercu commented 1 year ago

Thanks for flagging that broken link, it's fixed now. cc @ebetica for fixing the stats.parquet file, this one is based on an old early version of the database before filling in missing entries and dedups

tomgoddard commented 1 year ago

Do you have an idea when the stats.parquet file that corresponds the current online ESM database will be available?

tomsercu commented 1 year ago

Since Zeming is out of office, I took this over. The stats file is now updated with the complete, non-redundant set of keys:

Just in case anyone would want to reference it, the old file is copied to stats.old_bk.parquet under the same basepath.

Let me know if you encounter any further issues, closing this in the meantime. Will resolve #366 as well based on this file. Thank you again for flagging those issues!!

tomgoddard commented 1 year ago

Thanks! My main interest in stats.parquet was to get the list of MGnify identifiers for the database so that I could create a file of all the sequences for searching. You provided the sequence file in #366 which has solved that problem. But I may still uses stats.parquet to do filtering my model scores.

tomsercu commented 1 year ago

Ah yes happy we have the right data in place now! :)