Fenglei104 / DeepPROTACs

GNU General Public License v3.0
46 stars 16 forks source link

Providing preprocessing scripts #10

Closed diliadis closed 1 year ago

diliadis commented 1 year ago

Hello,

I was wondering if you could provide the scripts you used to start from the PROTAC-DB dataset and arrive at the version that can be used to train the model. I am not asking for the proprietary part of the data mentioned in the paper, just for the script used for the public one.

Thanks, Dimitris

Fenglei104 commented 1 year ago

Hello! I'm sorry I'm afraid that I cannot provide the scripts. We collected the data one by one manually, according to each single page in PROTAC-DB. To be detailed, we first found structures by UniProt ID in PROTAC-DB, and then aligned the left that do not have structures to them. The separation of E3 ligands, linkers and warheads and the classification of degradation were also done manually. All of these took us several months. To be honest, the pre-processing work is mostly done by my lab mates, and I just "use" the data.

diliadis commented 1 year ago

Thanks for the update! Is there a plan to publish the dataset then (even without including the proprietary part)?

Fenglei104 commented 1 year ago

Yes. The data is used in another project now. After finishing that, we will publish it.

diliadis commented 1 year ago

Great, good to know. Good luck with the project.

IgorHorta commented 1 year ago

Hey @Fenglei104 congrats for the project! any updates regarding the dataset ? :P thanks in advance

diliadis commented 1 year ago

Hey @IgorHorta, this is not my project :) Since I raised the issue here, I don't think the authors provided the full dataset. It is actually quite weird that they haven't since the accompanying paper has a "Data availability" subsection claiming that all PROTACs data used in their study is available in PROTAC-DB. This is not entirely true, as in the same paper, they claim to have augmented their dataset with more PROTACs from other sources (without any more details). Hopefully, in the near future, they will make the entire dataset publicly available.

IgorHorta commented 1 year ago

@diliadis oh gosh I tag the wrong person lol. thanks for answering.

any news on this @Fenglei104 ?

Fenglei104 commented 11 months ago

We are preparing a larger dataset because of the release of PROTAC-DB 2.0. The data will be released along with them. Since we process the data manually, one by one, it will take for a while. Sorry for keeping you waiting!