How do I rebuild Papyrus?

cthoyt commented 1 year ago

Is it possible to use these scripts to rebuild the dataset? I can't seem to find anything actually related to the acquisition and processing of the datasets described in the manuscript.

OlivierBeq commented 1 year ago

Short answer: No, not yet.

The creation of the dataset was done with Pipeline Pilot. All steps are described under the sub-heading Construction of Papyrus of the preprint's Material and Methods section.

I am currently in the process of moving this to Python scripts but this is no small task. Nevertheless, when equivalent to the Pipeline Pilot workflow, the scripts to reproduce the dataset will be made public!

cthoyt commented 1 year ago

oh, I see. I've read this section of the paper, but as you're probably aware, this is no substitute for reproducible code.

Please take a look at https://github.com/cthoyt/chembl-downloader to see if you can use that to improve the reproducibility of the parts related to ChEMBL. If there's any improvements that I can make to the package, let me know.

OlivierBeq commented 1 year ago

Our main objective was to have a library to ease the handling of the dataset to ease reproducible data mining and the development of QSAR and PCM models.

Given the limited academic use of Pipeline Pilot and amount of custom tools used to obtain the Papyrus dataset, we (all authors) decided to migrate the scripts to build the dataset to Python in a second step. I am currently doing this migration and already started using ChEMBL-downloader (awesome by the way!) and I did not need anything more than it provides. However I am far from having a solution for the complete workflow.

The struggle for the translation of the workflow to Python lies in finding open-source alternatives to the Pipeline Pilot cheminformatic components.

OlivierBeq commented 1 year ago

In summary

The scripts are on their way, not only for reproducibility but also to allow for local 'customisation' and/or inclusion of in-house data.
Be assured this is in active development.
The only obstacle/difficulty is about finding the right combination of cheminformatic libraries and steps to reproduce what is currently obtained with Pipeline Pilot.

adlvdl commented 1 year ago

Has there been progress on this issue? Or are there plans to update the papyrus datasets to keep track of ChEMBL versions? I think the last Papyrus dataset version available is 05.6 for ChEMBL 31 but there is already ChEMBL 33 available.

OlivierBeq commented 1 year ago

We are working on the next release, including ChEMBL version 33 data. At the same time, IUPHAR data will also be included. There has been tremendous progress on this issue (though not reflected in the branch yet):

we have identified most (if not all) libraries allowing for Pipeline Pilot results to be obtained
the data extraction and curation workflow for the ChEMBL data is complete

adlvdl commented 1 year ago

That is great to hear. Looking forward to the changes.

cthoyt commented 1 year ago

@OlivierBeq are you using https://github.com/cthoyt/chembl-downloader for this? I would love to create additional packages for getting this data in versioned nice ways and maybe write a short manuscrip describing how this can help improve cheminformatics more generally

OlivierBeq commented 1 year ago

@cthoyt, yes chembl_downloader is one of our dependencies. And it helps a lot speeding up the implementation.

OlivierBeq / Papyrus-scripts

How do I rebuild Papyrus? #3