Open ddofer opened 6 years ago
This could be a seperate readme file, no need to go overboard. e.g. "analcatdata" = "http://people.stern.nyu.edu/jsimonof/AnalCatData/" ?
Good idea. Not sure if we have the bandwidth to get around to doing that anytime soon, but we'll keep it filed here in case anyone wants to take this issue on.
Thanks for this wonderful resource and I am also interested in tracing the origin and background of each dataset as the read me file just contains " breast tumors". A line about the original source or the accompanying publication would be helpful. Thanks
Also interested in this!
Compliments from my side for gathering these datasets too! I agree it would be helpful to have information about the dataset source. Ideally a link to where the original is published, so you can find the description of the dataset at its origin. Maybe add it as a column to summary_stats?
Thanks @codrin-kruijne for this input! We actually tried to streamline this effort of adding sources last year. We now have a metadata.yaml
file for each dataset but not all have non-empty source field yet, but we're looking for contributors to add this information. See for example here.
Alternatively, you can get to the metadata by clicking on the octocat in the last column of the summary table on our main website: https://epistasislab.github.io/pmlb/ Hope that helps!
Thanks @trang1618 I added all the links to metadata.yaml files to our summary_stats table in a metadata column for easy access and I will encourage my colleagues to contribute when they find an incomplete one. I will explore a bit and then see how I might contribute.
Amazing @codrin-kruijne ! Thank you!!! 🙏🏽
+1 here -- this is a great resource, thank you for your work on it.
But also, the missing metadata is a major pain point. There really isn't any way for another contributor to even find where many of these datasets are from, let alone understand more about the dataset itself (what do the labels mean? what are the features? etc.).
We (as users of the package) have no idea where the individual datasets were drawn from, and there isn't any information even in the .yaml files or the published paper. This is really something the developers will need to lead the charge on, or at least provide more information so that others can help :) . Can you at minumum provide a link to each dataset's original source in the metadata table (https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv), or any information at all about where it is from to narrow the search (Kaggle, UCI, etc.)?
hi @jpgard, unfortunately the dev team for this project has turned over a few times since 2017 and we don't have perfect verifiable source info for many datasets, so it takes time. Still, we've annotated many datasets with source info; out of 420 current datasets, we still need metadata on about 246 of them. We have a contribution guide for verifying source: https://epistasislab.github.io/pmlb/contributing.html and some example PRs using colab (e.g. https://github.com/EpistasisLab/pmlb/pull/86). Every little bit helps!