Closed AlfredTheBest closed 2 months ago
Thank you for your attention. The pre-training dataset is too large(larger than 1T as I remember), so we did not upload our dataset. You could do these things to get the pre-training dataset:
Because the process of seeking better structure data is too long(several months) and gets above filtering methods, our scripts are scattered in different places and different authors' hands. It was hard to get a good repo to do these things in one script when I was doing an internship in DP Technology. Sorry for that.
be careful about those chain names with number or number+letter:)
Thanks for you kind response. We want to reptrain the net, but the data process was so long as you said. So could u pls public the uniprotID list for the data or something else?
This week, I regret to inform you that I am busy working on some other projects. However, I will try to gather the IDs and will upload the statistics in the next week. Thank you for your understanding and patience.
This is the link storing the pdbids and afdb ids of our dataset: https://drive.google.com/drive/folders/1qb2f-i0E4hk_5KKTR6-0PY60XToSWr1m?usp=sharing
As I hinted before, I am no longer an intern in DP. This version of dataset is as close to our last version of dataset as possible I can find. There should be a few other filters after this. As I left, our last version of dataset is further modified for some other projects.
Also, our cluster command is like this: mmseqs easy-cluster train.fasta clusterRes ./tmp --min-seq-id 0.4 -c 0.8 --cov-mode 1
Samples in the validation set are all structures from real experiments. So the ids are just pdbid+chain name. Samples in the training set are from afdb and PDB. Ids are like cif or af2 + id + chain name. Samples in the validation set do not have too high similarity with the training set.
Hi, How can we get 130K pre-training structure dataset?