Backwards compatibility

EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

https://epistasislab.github.io/pmlb/

MIT License

805 stars 135 forks source link

Backwards compatibility #119

Open josepablocam opened 4 years ago

josepablocam commented 4 years ago

Thank you all for the great package! It is really lovely to be able to include pmlb in a project and only fetch datasets as need (while avoiding having to personally host and provide access to others directly). Recently I ran into the situation that older versions of pmlb now return 404 errors, and when upgrading pmlb some datasets no longer exist or have been renamed. In general, not a huge deal for ad-hoc projects, but pmlb provides this great functionality for some longer term artifacts.

With that in mind, is there any interest in receiving a patch for (some amount) of backwards compatibility? It would be great if datasets that were removed because they are duplicates re-directed to their canonical names, and similarly that failed requests made a best-efforts attempt to provide an alternative request that succeeds.

I'd be happy to take a look at this, but wanted to gauge interest first (in the meantime I've found myself having to include old datasets directly in repos).

lacava commented 4 years ago

hi @josepablocam, we'd be happy to receive such a patch and would really appreciate it! I think it's a bit tricky to handle, so I'm not sure of how to do it, but let us know what you're thinking. Maybe @weixuanfu or @trang1618 or @JDRomano2 can weigh in as well.

josepablocam commented 4 years ago

@lacava great, let me take a look and plan something out and I'll share here and see what everyone thinks. Thanks again for the library, really appreciate all the work you and the other maintainers/developers have put into this!

trangdata commented 4 years ago

Thank you for the suggestion and for offering to help @josepablocam! I'm wondering if a mapping from old dataset names to current dataset names would help?

I do want to emphasize that using the current benchmark collection would be most recommended to avoid errors in past data.

JDRomano2 commented 4 years ago

The easiest way to do this is - as @trang1618 said - create a file that lives someplace in the source tree, mapping obsolete database names to their current names, and have fetch_data() check against the contents of this file every time you run it.

Obviously, the tough part will be retroactively identifying all of these changes up until the current version. Has there been any convention (formal or informal) for mentioning when a database name changes, like in commit messages?

As part of this, there is probably a more graceful way PMLB can fail when it can't find the database other than returning a 404. Even just error text with a link to currently valid databases and a link to a Github Issue template for reporting it when legacy database access has 'broken', so we could add it to this hypothetical mapping file.

lacava commented 4 years ago

create a file that lives someplace in the source tree, mapping obsolete database names to their current names, and have fetch_data() check against the contents of this file every time you run it.

that seems like a good solution for handling revisions to dataset names within the current version. i realize now I was thinking of "forwards" compatibility when mentioning trickiness, i.e. coming up with some kind of link redirect strategy or using symbolic links to work with older released versions of fetch_data() (e.g. in https://pypi.org/project/pmlb/0.3/).

Obviously, the tough part will be retroactively identifying all of these changes up until the current version. Has there been any convention (formal or informal) for mentioning when a database name changes, like in commit messages?

I don't think so.

As part of this, there is probably a more graceful way PMLB can fail when it can't find the database other than returning a 404. Even just error text with a link to currently valid databases and a link to a Github Issue template for reporting it when legacy database access has 'broken', so we could add it to this hypothetical mapping file.

I like this idea as well.