Closed zackkrida closed 1 year ago
I got to chat with Zack about this today and have two requests for HuggingFace that I think are worthwhile conditions of the dataset that should be published alongside it:
Excelent points @sarayourfriend!
Fully in agreement that all these 3 factors are things that have to be absolutely considered for any downstream applications of the dataset, including for training models (and it could also inform the creation of sub-sets that are). It also merits the points from @Skylion007 on the basis of creating and maintaining image mirrors and sub-sets for benchmarking and training models.
In regards to how best communicate this points to the dataset users on the Hugging Face Hub, this is usually done in the Dataset Card. I'd be more than happy to work with you on drafting a dataset card to go online with the model. Here's more information on dataset cards and an example.
@sarayourfriend thanks for capturing the chat here!
Here's a sample dataset on huggingface (@Skylion007's dataset, no less!), which should be a decent representation of what ours would look like:
https://huggingface.co/datasets/openwebtext
I imagine much of what we chatted about could be included there as documentation on the dataset, so that anyone who trains a model using it can adhere to our recommendations.
It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.
@apolinario, any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?
It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.
Are HuggingFace datasets all meant for model training? If so, then we can proactively exclude them under that principle. If it is the case that HuggingFace is not meant to host datasets that cannot be used for model training, then I think we should also seek an additional home for the full dataset. Our aggregated catalogue of openly licensed works is useful for things other than model training that wouldn't fall under "derivative works" even sceptically (like research into how certain license conditions are applied, to name one basic thing). Work to produce even a subset of our dataset in a consumable format would undoubtedly benefit a whole-dataset version to be published elsewhere with usage restrictions, so aside from needing to find an additional home for that complete dataset, I don't think this causes problems with the plan generally.
any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?
In my opinion, they should not be excluded from the dataset entirely for a few reasons:
I believe a way the dataset card could be framed is that this dataset is as close as possible to a raw datadump of Openverse metadata - and that it should not be used as-is to train models without taking in consideration the licenses - and all the points made above (and others points we may find useful). With time, curated, filtered and deduped datasets more tailored to train models will emerge, and if that is of interest of Openverse, more aligned ones could even be listed as references.
Are HuggingFace datasets all meant for model training?
Not really - the HF Datasets library and Hub can be used for broader use-cases than just model training and I think could be a home for the entire dump, as long as the safeguards are put in place to make sure people don't see it as a simply a full dataset that can be used as-is to train a model without taking into consideration particularities of each license (and I believe we have tools to put those points up clearly)
That sounds great @apolinario, thanks for the insight! That approach sounds ideal for everyone while still doing as much as we can to communicate the intention of licensors.
Totally agree with all the points made by @apolinario , we are planning putting together a paper describing how the Openverse infrastructure works, how the data is hosted, and how the content is created. HF Datasets are useful for a variety of data science applications, particularly when they properly support streaming etc.
I would also like to add that I would like reduce scraping for the data hosts as much as possible to by including all the images that we can safely, legally redistribute as an additional objective.
Are HuggingFace datasets all meant for model training?
also want to mention that many datasets (or subsets of datasets) hosted on HF are meant for evaluation not training, notably, all the eval
or test
splits.
And for instance we apply <meta name="robots" content="noindex,noai,noimageai" />
on those (in addition to human readable disclaimers) so that automated scraping tools hopefully do not lead to model training on them
Update: This week I am wrapping up the project proposal and implementation plan. The implementation plan PR will get into the technical aspects of how exactly we create the dataset, which fields to include, the preferred output format, and preferred transfer mechanism. I will solicit comments on that PR while it is a draft so I can incorporate advice from community contributors.
Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Last week, many of the core maintainers of Openverse had the opportunity to meet together synchronously and discuss the efforts around publishing the Openverse dataset. After much discussion, the maintainer team has decided to pause work on this effort for the time being for a few reasons:
Engagement with the community - As we progressed with the project, we realized the paramount importance of deeper collaboration with the broader community, especially experts in the Creative Commons, licensing, and open data domains. We are committed to making decisions that uphold the rights and interests of creators. By pausing, we are taking the necessary time to consult, collaborate, and ensure that the release of this dataset is both appropriate and beneficial, and that it doesn't inadvertently negatively impact creators. This collaborative approach ensures that we are not only aligned with technical standards but also with the ethos and values that the broader community cherishes.
Focus on the core Openverse users - At its heart, Openverse is primarily a search engine. Our central mission has always been to offer an excellent search experience for individuals seeking to discover and utilize creators' works. We believe it is crucial to preserve and enrich the simplicity, effectiveness, and user-centric nature of our search experience. Hence, our commitment remains steadfast in directing our resources and efforts towards projects and features that elevate and refine the core experience of Openverse users: finding openly licensed media for a myriad of use-cases.
Concerns over dataset governance, access, management, and upkeep - The complexity and intricacies related to ensuring proper governance, access control, robust management, and regular upkeep of such an expansive dataset have come to light. We are committed to upholding the highest standards in data stewardship, and at present we do not have the capacity to address any issues that may come up. For instance: What if a creator wants to be removed from the dataset? How do we establish access restrictions that prevent erroneous use of certain licenses? How do we handle reports of sensitive content within the dataset? How do we address reports about incorrect licenses for certain creations? The effort necessary to be appropriate stewards of this data is non-trivial, and we do not want to discount the level of work this might require.
Project prioritization - On assessment of our currently ongoing projects and the plans we had scoped for the rest of the year, we believe that there are other initiatives that currently demand our attention and resources. In pursuing the scoping for the dataset creation, we put on-hold projects that would improve user safety and enable new ways to browse works within Openverse. We also recently revised our expected roadmap for 2023 to reflect the reduced availability of some of our maintainers through the end of the year. Reprioritizing in this way ensures that Openverse continues to evolve and serve its community effectively, and we are redirecting our focus to areas which we believe will bring more immediate value to our users.
We are deeply appreciative of the enthusiasm and support from the community around the Openverse dataset project, particularly @apolinario, @Skylion007, and others. Our decision to pause is by no means an end, but rather a strategic recalibration to ensure we deliver only the best for our community. We look forward to continuing to work with the community members we're currently in collaboration with, along with other institutions operating in a similar space and under related principles.
Description
This project aims to publish and regularly update datasets for each of Openverse's media types, currently image and audio. We aim to provide access to the data currently served by our API, but which is difficult and costly to access in full.
This project aims to:
Through conversations with and offers from @Skylion007 and @apolinario we've identified HuggingFace as a home for the initial, raw metadata dump. They've graciously agreed to help and/or publish complementary datasets.
This project will need a proposal, and at least one implementation plan for producing the initial dump and/or establishing a mechanism for creating yearly/every-six-month dumps of the Openverse dataset.
Documents
Issues
Prior Art