dataset - Githubissues

liuzhelz commented 2 years ago

Can you provide the training set and verification set (PDB ID) used in the paper？ @akulikova64 @danny305 @clauswilke

clauswilke commented 2 years ago

https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/8HJEF9

liuzhelz commented 2 years ago

https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/8HJEF9

Thank you for your reply. I downloaded the data set according to your link, with a total of 17263 PDBs. However, your previous work(MutCompute) described in the paper provided 19427 structures. After processing, this article only has 16569 structures. This is different from 17263. What is the reason? @clauswilke

clauswilke commented 2 years ago

Different filtering compared to previous works. From the paper:

This set provided us with 19,427 distinct protein data bank (PDB) identifiers corresponding to structures with at least a 2.5 Å resolution. Next, we filtered down our dataset by using a 50% sequence similarity threshold at the protein chain level and removing structures where we could not add hydrogen atoms or partial charges in an automated fashion with PDB2PQR (v3.1.0) [5]. Finally, we removed any protein chains that had more than 50% sequence similarity to any structure in the PSICOV dataset [18]. The PSICOV dataset contains 150 extensively studied protein structures and we used it here as a hold-out test dataset to evaluate the CNN models (see below). The final dataset used for training the CNN models consisted of 16,569 chains.

liuzhelz commented 2 years ago

OK, I understand this question. However, 17263 PDB was downloaded according to the link, but this paper describes the use of 16569 PDB training model. What is the reason？ @clauswilke

clauswilke commented 2 years ago

Ah, that's a question for @danny305. I would suspect the 17263 are before filtering for similarity to the PSICOV dataset.

liuzhelz commented 2 years ago

Thank you for your reply. I think you may be right.

liuzhelz commented 2 years ago

I have reviewed the method of building data sets in your previous work(MutCompute). I found that the method of building data sets is similar to that in this paper. Both are filtered according to the resolution of 2.5Å and the sequence similarity of 50, but the number of data sets after filtering is different.Are there any subtle differences that I didn't notice? The following is the description of MutCompute's paper: @clauswilke

Additionally, deposited crystallographic structures are refined by algorithms of their time which are not necessarily the current state of the art. To improve data set composition and uniformity, we gathered all PDB structures with less than 2.5 Å resolution and at most 50% sequence similarity and drew from structures in the PDB-REDO database, where existing protein structures are refined in a uniform manner.13 These two changes in data consistency resulted in 19 436 structures for training with 300 of these structures held for out-of-sample testing and increased wild-type prediction accuracy to 63%

akulikova64 / CNN_protein_landscape

dataset #2