OlivierBeq / Papyrus-scripts

MIT License
18 stars 9 forks source link

Possible change in Dataset size? #4

Closed LongHung-Pham closed 1 year ago

LongHung-Pham commented 1 year ago

Dear the Papyrus team,

Thank you very much for your dedication in building this database. I have met with an issue of reproducibility. About one month ago (March 20th), when I downloaded the high-quality Papyrus++ dataset, the file size was 546MB. But now when I'm downloading with the same command, the size is 473MB. This leads to many changes in the number of activity points I can query, although the dataset version is still 05.6.

The python code that I used to download data is download_papyrus(version='latest', structures = True, descriptors = None)

If possible, can you inform me if this is because of data cleaning (possible removing of duplicate points), or is it possible to download the old data for reproducibility? Thank you very much and I hope to hear from you.

OlivierBeq commented 1 year ago

Hi @LongHung-Pham,

The Papyrus++ dataset for versions 05.4, 05.5 and 05.6 contained duplicated values of pchembl_value and inconsistent annotations of the corresponding assay types.

We have updated and annotated the Zenodo submissions with this fix on Wednesday, April 5th (version 05.5, and version 05.6). And the previous (though incorrect) submissions are still available on Zenodo.

The download utility's links were modified immediately to ensure users would benefit from the fixes. Though the size of the dataset is impacted, the incorrectly duplicated data was corrected, fixing the unintended data augmentation.

Should you experience differences with the newer release, we recommend the use of data augmentation to obtain close performance.

Our excuses for the inconvenience!

LongHung-Pham commented 1 year ago

Hi @OlivierBeq , Thank you so much for your clarification, now I can be sure to use the new data. I hope the team continues to update and curate the database. Thank you once again!