GET pretrain dataset - Githubissues

AlfredTheBest commented 2 months ago

Hi, How can we get 130K pre-training structure dataset?

Heisenburger2020 commented 2 months ago

Thank you for your attention. The pre-training dataset is too large(larger than 1T as I remember), so we did not upload our dataset. You could do these things to get the pre-training dataset:

Download all mmcif files from RCSB and AFDBv4. We read mmcif files with scripts in Unfold(https://github.com/dptech-corp/Uni-Fold/blob/main/scripts/chain_label_from_mmcif.py)(which also includes scripts downloading all mmcifs(https://github.com/dptech-corp/Uni-Fold/tree/main/scripts/download)).
Filter out structures with resolution greater than 9 and plddt lower than 70.
Use pdbfixer to fix missing atoms.
Be careful about atoms with coordinates equal 0, 0, 0(delete these atoms).
Delete those chains with losing residues greater than 20%. Delete protein chain with length less than 10.
Filter according to esm data process procedures.
Filter those atoms with less than 1 Angstrom distance from the other heavy atoms.
Run ESM inference.
Cluster with mmseq.

Because the process of seeking better structure data is too long(several months) and gets above filtering methods, our scripts are scattered in different places and different authors' hands. It was hard to get a good repo to do these things in one script when I was doing an internship in DP Technology. Sorry for that.

be careful about those chain names with number or number+letter:)

AlfredTheBest commented 2 months ago

Thanks for you kind response. We want to reptrain the net, but the data process was so long as you said. So could u pls public the uniprotID list for the data or something else?

Heisenburger2020 commented 2 months ago

This week, I regret to inform you that I am busy working on some other projects. However, I will try to gather the IDs and will upload the statistics in the next week. Thank you for your understanding and patience.

Heisenburger2020 commented 2 months ago

This is the link storing the pdbids and afdb ids of our dataset: https://drive.google.com/drive/folders/1qb2f-i0E4hk_5KKTR6-0PY60XToSWr1m?usp=sharing

As I hinted before, I am no longer an intern in DP. This version of dataset is as close to our last version of dataset as possible I can find. There should be a few other filters after this. As I left, our last version of dataset is further modified for some other projects.

Also, our cluster command is like this: mmseqs easy-cluster train.fasta clusterRes ./tmp --min-seq-id 0.4 -c 0.8 --cov-mode 1

Samples in the validation set are all structures from real experiments. So the ids are just pdbid+chain name. Samples in the training set are from afdb and PDB. Ids are like cif or af2 + id + chain name. Samples in the validation set do not have too high similarity with the training set.

Heisenburger2020 / Vabs-Net

GET pretrain dataset #1