facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.29k stars 644 forks source link

v2023_02 tarballs have less structures than listed in metadata-rc2.parquet #552

Open Khalimat opened 1 year ago

Khalimat commented 1 year ago

Dear all,

I downloaded the structures from the release v2023_02 with TM and pLDDT > 70 (71 .tar.gz files). The tarballs contain 30 173 418 structures.

So I wanted to know specific score for these, so I opened the metadata.parquet file, filtered it on the scores, and then selected structures which were in the v2023_02 realise of the ESM-Atlas (as I am aware, it should have all MGnify90_2023_02 proteins).

esm_metadata = pd.read_parquet('metadata-rc2.parquet')
esm_metadata_f = esm_metadata[esm_metadata['ptm'] >= 0.7]
esm_metadata_f = esm_metadata_f[esm_metadata_f['plddt'] >= 0.7]
esm_metadata_f_v2 = esm_metadata_f[esm_metadata_f['sequence_dbs'].isin(['MGnify90_2022_05, MGnify90_2023_02', 'MGnify90_2023_02'])]
esm_metadata_f_v2.shape 
(203052430, 10)

So, while the metadata lists 203052430 high quality structures in the second release, seems I can download only a small fraction of these (~15 %). Of course I also tried to filter with > 0.7, but it did not severely impact the number.

Could you please let me know if it would be possible to download all high quality structures listed in the metadata?

Thank you!

Melarok commented 8 months ago

Did you ever find a solution? I'm currently facing the same issue.

Khalimat commented 4 months ago

Nope, but I did not try to download the files again...