Credit/Origin? - Githubissues

EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

https://epistasislab.github.io/pmlb/

MIT License

804 stars 134 forks source link

Credit/Origin? #13

Open ddofer opened 6 years ago

ddofer commented 6 years ago

Nice resource! I may add some to it in future (although the ones I use for benchmarking are considerably "rarer" than the ones here - time-series + raw text + locations, entities, etc') .
The varied datasets dont seem to have credit as to their origin. (e.g. "winered" - I assume is the wine datasets from UCI, but there's nothing about that in the data folder or the csv.gz file). Adding the origin (even at the "site" level, e.g. "UCI", "open-ML", "kaggle datasets", "KDD") would make it much easier to analyze the original datasets, context ,domain and interpretation (e.g. "Looking for datasets on time-series + predictive maintenance").

ddofer commented 6 years ago

This could be a seperate readme file, no need to go overboard. e.g. "analcatdata" = "http://people.stern.nyu.edu/jsimonof/AnalCatData/" ?

rhiever commented 6 years ago

Good idea. Not sure if we have the bandwidth to get around to doing that anytime soon, but we'll keep it filed here in case anyone wants to take this issue on.

darwinbandoy commented 5 years ago

Thanks for this wonderful resource and I am also interested in tracing the origin and background of each dataset as the read me file just contains " breast tumors". A line about the original source or the accompanying publication would be helpful. Thanks

csinva commented 5 years ago

Also interested in this!

codrin-kruijne commented 3 years ago

Compliments from my side for gathering these datasets too! I agree it would be helpful to have information about the dataset source. Ideally a link to where the original is published, so you can find the description of the dataset at its origin. Maybe add it as a column to summary_stats?

trangdata commented 3 years ago

Thanks @codrin-kruijne for this input! We actually tried to streamline this effort of adding sources last year. We now have a metadata.yaml file for each dataset but not all have non-empty source field yet, but we're looking for contributors to add this information. See for example here.

Alternatively, you can get to the metadata by clicking on the octocat in the last column of the summary table on our main website: https://epistasislab.github.io/pmlb/ Hope that helps!

codrin-kruijne commented 3 years ago

Thanks @trang1618 I added all the links to metadata.yaml files to our summary_stats table in a metadata column for easy access and I will encourage my colleagues to contribute when they find an incomplete one. I will explore a bit and then see how I might contribute.

trangdata commented 3 years ago

Amazing @codrin-kruijne ! Thank you!!! 🙏🏽

jpgard commented 1 year ago

+1 here -- this is a great resource, thank you for your work on it.

But also, the missing metadata is a major pain point. There really isn't any way for another contributor to even find where many of these datasets are from, let alone understand more about the dataset itself (what do the labels mean? what are the features? etc.).

We (as users of the package) have no idea where the individual datasets were drawn from, and there isn't any information even in the .yaml files or the published paper. This is really something the developers will need to lead the charge on, or at least provide more information so that others can help :) . Can you at minumum provide a link to each dataset's original source in the metadata table (https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv), or any information at all about where it is from to narrow the search (Kaggle, UCI, etc.)?

lacava commented 1 year ago

hi @jpgard, unfortunately the dev team for this project has turned over a few times since 2017 and we don't have perfect verifiable source info for many datasets, so it takes time. Still, we've annotated many datasets with source info; out of 420 current datasets, we still need metadata on about 246 of them. We have a contribution guide for verifying source: https://epistasislab.github.io/pmlb/contributing.html and some example PRs using colab (e.g. https://github.com/EpistasisLab/pmlb/pull/86). Every little bit helps!