DeepGraphLearning / ProtST

[ICML-23 ORAL] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Apache License 2.0
87 stars 7 forks source link

Extract dataset takes too long #4

Closed CryoSky closed 1 year ago

CryoSky commented 1 year ago

Hello, it's very nice to see this work. I'm trying to download the script and use the function annotation. However, after I ran run_downstream.py, it starts with the dataset building but my progress for Constructing proteins from pdbs was stuck at around 92%. The progress estimate is that it takes more than 10 days to build. Could you share me how to quickly construct the proteins? Thank you very much!

KatarinaYuan commented 1 year ago

Hi, thank you for your interest in our work. Generally, according to our experience, the downloading process for downstream datasets should be fast (less than 1 hour) if you are downloading dataset to remote cluster. If you are downloading to personal laptop, it may be useful to check the wifi speed. And could you specify which downstream dataset you are trying to download, so that we can replicate the process to better help?

KatarinaYuan commented 1 year ago

There is also a more direct way to control the dowloading process with wget command in shell. Just manually download the dataset with its url to the directory dataset.path and try running the script again.

The url of dataset can either be found in our repo ./ProtST-dev/protst/dataset.py or in torchdrug's repo https://github.com/DeepGraphLearning/torchdrug/tree/master/torchdrug/datasets.

And the dataset.path has been specified in downstream configs (e.g., "./ProtST-dev/config/downstream_task/PretrainESM/annotation_tune.yaml").

CryoSky commented 1 year ago

Hello, thank you for the reply. I don't mean downloading or unzipping the database file takes too long. Instead, I think there is a step to construct the pdb files into a pkl file that takes a very long time. I can do it very quickly on my server but my personal computer with NVIDIA 2080Ti is estimated to take more than 10 days. If you happen to have a similar issue please let me know.

KatarinaYuan commented 1 year ago

Hi, it sounds a good news that it runs quickly on servers. Considering this, the issue may be specific to the local environment.

I'm sorry that I don't have the same local environment as NVIDIA 2080Ti and it may be hard for me assist further. But please do let me know if there is anything else I can help with.

CryoSky commented 1 year ago

Hi @KatarinaYuan, thank you for the reply. I tried to test Protst on the server and this issue is no longer a roadblock. Thus, I will close this ticket and thank you for the reply.