a-r-j / ProteinWorkshop

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
https://proteins.sh/
MIT License
192 stars 16 forks source link

4v8m not found in raw directory. When processing go-bp dataset #96

Open yangzhang33 opened 2 months ago

yangzhang33 commented 2 months ago

Hello, when running classification task on go-bp dataset, it gives an error:
FileNotFoundError: 4v8m not found in raw directory. Are you sure it's downloaded and has the format pdb?

with the format=pdb(cause the mmtf doesn't work)

I checked the pdb site: it says for large graphs pdb file is not available

截屏2024-07-23 14 42 09

Is there any way to work around this?

a-r-j commented 2 months ago

Hi @yangzhang33, thanks for flagging. This is a little tricky. I'd suggest removing that example from the dataset for now. If you're keen to include it I think you can download the mmcif, extract the relevant chains and write them to a PDB file (i.e. using BioPandas). I don't think it's possible to get the structure in pdb format otherwise.