Open cthoyt opened 1 year ago
Short answer: No, not yet.
The creation of the dataset was done with Pipeline Pilot.
All steps are described under the sub-heading Construction of Papyrus
of the preprint's Material and Methods
section.
I am currently in the process of moving this to Python scripts but this is no small task. Nevertheless, when equivalent to the Pipeline Pilot workflow, the scripts to reproduce the dataset will be made public!
oh, I see. I've read this section of the paper, but as you're probably aware, this is no substitute for reproducible code.
Please take a look at https://github.com/cthoyt/chembl-downloader to see if you can use that to improve the reproducibility of the parts related to ChEMBL. If there's any improvements that I can make to the package, let me know.
Our main objective was to have a library to ease the handling of the dataset to ease reproducible data mining and the development of QSAR and PCM models.
Given the limited academic use of Pipeline Pilot and amount of custom tools used to obtain the Papyrus dataset, we (all authors) decided to migrate the scripts to build the dataset to Python in a second step. I am currently doing this migration and already started using ChEMBL-downloader (awesome by the way!) and I did not need anything more than it provides. However I am far from having a solution for the complete workflow.
The struggle for the translation of the workflow to Python lies in finding open-source alternatives to the Pipeline Pilot cheminformatic components.
In summary
Has there been progress on this issue? Or are there plans to update the papyrus datasets to keep track of ChEMBL versions? I think the last Papyrus dataset version available is 05.6 for ChEMBL 31 but there is already ChEMBL 33 available.
We are working on the next release, including ChEMBL version 33 data. At the same time, IUPHAR data will also be included. There has been tremendous progress on this issue (though not reflected in the branch yet):
That is great to hear. Looking forward to the changes.
@OlivierBeq are you using https://github.com/cthoyt/chembl-downloader for this? I would love to create additional packages for getting this data in versioned nice ways and maybe write a short manuscrip describing how this can help improve cheminformatics more generally
@cthoyt, yes chembl_downloader is one of our dependencies. And it helps a lot speeding up the implementation.
Is it possible to use these scripts to rebuild the dataset? I can't seem to find anything actually related to the acquisition and processing of the datasets described in the manuscript.