EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Clean up commit history #102

Closed JDRomano2 closed 4 years ago

JDRomano2 commented 4 years ago

30 addressed lack of Git LFS for the large dataset files. It makes sense to remove these from the commit history, as well. The main affect is reducing the size of the repository when cloned, but it also has other beneficial side effects such as making the commit history easier to browse and navigate.

Aside from removing large dataset files from the history, is there anything else we can/should clean up?

JDRomano2 commented 4 years ago

Used bfg-repo-cleaner to remove all blobs containing .gz and .html files from the history (the most recent commit is untouched).

For example, no .tsv.gz source file is present in the following directory: https://github.com/EpistasisLab/penn-ml-benchmarks/tree/51207e96ce3ccb047908fd0d2532344d77573fc6/datasets/1027_ESL

All users should re-clone the repository to avoid adding 'dirty' files back in when new features are merged into master. For the short future, new pull-requests should be inspected to make sure old database or profiling reports haven't been reintroduced (however, this should be fairly obvious).