Summary

This PR is the first step towards overhauling the dataset loading process to make it easier to bring your own dataset.

[x] Code passes all tests
[x] Unit tests provided for these changes
[x] Documentation and docstrings added for these changes

Changes

Refactor code for DrugBank and TWOSIDES importers (a lot of code could have been shared)
Better standardize contexts from DrugBank
Enforce sort order on drugs and contexts in output
Provide a way of setting the negative sampling ratio
Use a fixed random seed of 42 (that's configurable)
Automate download to reliable location with pystow
Enable easily choosing a different number of top side effects for TWOSIDES
Apply code QA checks to ALL folders (therefore no possibility to create a new folder where checks aren't done anymore)

Next Steps

I'm not really sure if we should use the datasets that have generate negative samples in practice since that might lead to overfitting. I guess for ML people, having the datasets just there is good because then they don't have to think about quality or concerns like these. These two things are always big conflicts in my mind.

Ultimately, I'd like an interface that does all of the data-preprocessing on-the-fly and uses locally cached results instead of looking for a web-based version of the dataset. I will look into this after #50 is done and I can check out the code for DrugComb and DrugCombDB.

AstraZeneca / chemicalx

Clean up DrugBank and TWOSIDES importers #57

Summary

Changes

Next Steps

Codecov Report