Providing and maintaining Openverse media datasets

zackkrida commented 1 year ago

Description

This project aims to publish and regularly update datasets for each of Openverse's media types, currently image and audio. We aim to provide access to the data currently served by our API, but which is difficult and costly to access in full.

This project aims to:

Increase data access to the Openverse dataset for academics, researchers, and other interested users
Reduce scraping of the Openverse API and frontend caused by lack of access
Collect downstream metadata whose generation is made possible by the ease of access to the dataset, for example:
- Machine captions
- Translation of metadata
- Deduplication data

Through conversations with and offers from @Skylion007 and @apolinario we've identified HuggingFace as a home for the initial, raw metadata dump. They've graciously agreed to help and/or publish complementary datasets.

This project will need a proposal, and at least one implementation plan for producing the initial dump and/or establishing a mechanism for creating yearly/every-six-month dumps of the Openverse dataset.

Documents

[ ] #2637
[ ] Implementation Plan(s)

Issues

Prior Art

sarayourfriend commented 1 year ago

I got to chat with Zack about this today and have two requests for HuggingFace that I think are worthwhile conditions of the dataset that should be published alongside it:

Do not use "no-derivatives" (ND) licensed works to create derivative works. That's redundant with the license text, but because this is still a contentious point with respect to machine learning models, I think it's worth making an explicit note about this. Openverse includes hundreds of millions of works and only a fraction are licensed with ND. Because every single work catalogued by Openverse includes explicit license terms that allow for specific usage, not using ND licensed works to great anything that might even questionable (from various perspectives) be considered a derivative work (like an ML model) can just exclude them, probably to not detriment of the model. And if it does hurt the model, then great, that's excellent information to know for model creators and they should be prepared to acknowledge that problem and ready to deal with the consequences of it. Specifically, are they really ready to explicitly rely on works to create models that creators have likewise explicitly said they do not wish to have derivatives made from? In any case, if model creators simply do not use ND works, which Openverse's dataset makes trivial to do because we distribute fine-grained license information with every work in our catalogue, it would go a long way to garnering good faith from folks who've openly licensed their works.
Similarly, "no-commercial" (NC) licensed works should not be used for models created to be used in commercial contexts. Those models should likewise be licensed to explicitly disallow commercial use (just as a work derivative of an NC licensed work would be). I'm not as clear on this as the ND stuff, as in, I'm not sure if derivative works based on NC licensed works can themselves be used commercially. In any case, as above, it's probably safe to say that not using NC licensed works for models that will be used commercial should not hurt the models considering that NC licensed works are only a fraction of the hundreds of millions of openly licensed works in the Openverse catalogue.
Finally, an attribution page listing every single work used in the creation of the model. It's one of the basic parts of CC licenses other than 0. Every work in Openverse's catalogue other than the CC0 and PDM works require attribution. In the spirit of good faith of that, providing attribution for the models the work is trained on should be trivial (compared to the work of training the model). Yes it would be a massive page of text, but that itself would also drive home the provenance of such models and the fact that they can only exist because huge numbers of artists and creators generally have decided to license their works openly: a tremendous thing that can be celebrated, but only if it is explicitly noted.

apolinario commented 1 year ago

Excelent points @sarayourfriend!

Fully in agreement that all these 3 factors are things that have to be absolutely considered for any downstream applications of the dataset, including for training models (and it could also inform the creation of sub-sets that are). It also merits the points from @Skylion007 on the basis of creating and maintaining image mirrors and sub-sets for benchmarking and training models.

In regards to how best communicate this points to the dataset users on the Hugging Face Hub, this is usually done in the Dataset Card. I'd be more than happy to work with you on drafting a dataset card to go online with the model. Here's more information on dataset cards and an example.

zackkrida commented 1 year ago

@sarayourfriend thanks for capturing the chat here!

Here's a sample dataset on huggingface (@Skylion007's dataset, no less!), which should be a decent representation of what ours would look like:

https://huggingface.co/datasets/openwebtext

I imagine much of what we chatted about could be included there as documentation on the dataset, so that anyone who trains a model using it can adhere to our recommendations.

It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.

@apolinario, any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?

sarayourfriend commented 1 year ago

It does make me wonder if we should exclude ND works from the dataset entirely, or just caveat that they shouldn't be used to train models. I'll need to think about that more.

Are HuggingFace datasets all meant for model training? If so, then we can proactively exclude them under that principle. If it is the case that HuggingFace is not meant to host datasets that cannot be used for model training, then I think we should also seek an additional home for the full dataset. Our aggregated catalogue of openly licensed works is useful for things other than model training that wouldn't fall under "derivative works" even sceptically (like research into how certain license conditions are applied, to name one basic thing). Work to produce even a subset of our dataset in a consumable format would undoubtedly benefit a whole-dataset version to be published elsewhere with usage restrictions, so aside from needing to find an additional home for that complete dataset, I don't think this causes problems with the plan generally.

apolinario commented 1 year ago

any suggestions for datasets with records which shouldn't be used for model training? Should they be excluded from the dataset entirely?

In my opinion, they should not be excluded from the dataset entirely for a few reasons:

It could lose its reference as "the openverse database dump", which could affect the objectives 1 and 2 of the project aims. Not everyone may want to use the dataset to train ML models. Some people may want to use it as a dump of OpenVerse for any downstream compliant use-cases of the content with the ND license (e.g.: a bias analysis on CC content), and use the HF Datasets format as a convenient to do so in case they don't have 4TB to store the data on a more traditional "dump" format.
Some ML tasks could be performed to ND works that would not, imo (not legal advice) be considered derivative works. For example, one could build a "semantic search" ML application over the entire Openverse dump - that could potentially be contributed upstream to improve the search function of Openverse - a process like this would probably be useful to search across all images and likely not violate the ND license.

I believe a way the dataset card could be framed is that this dataset is as close as possible to a raw datadump of Openverse metadata - and that it should not be used as-is to train models without taking in consideration the licenses - and all the points made above (and others points we may find useful). With time, curated, filtered and deduped datasets more tailored to train models will emerge, and if that is of interest of Openverse, more aligned ones could even be listed as references.

Are HuggingFace datasets all meant for model training?

Not really - the HF Datasets library and Hub can be used for broader use-cases than just model training and I think could be a home for the entire dump, as long as the safeguards are put in place to make sure people don't see it as a simply a full dataset that can be used as-is to train a model without taking into consideration particularities of each license (and I believe we have tools to put those points up clearly)

sarayourfriend commented 1 year ago

That sounds great @apolinario, thanks for the insight! That approach sounds ideal for everyone while still doing as much as we can to communicate the intention of licensors.

Skylion007 commented 1 year ago

Totally agree with all the points made by @apolinario , we are planning putting together a paper describing how the Openverse infrastructure works, how the data is hosted, and how the content is created. HF Datasets are useful for a variety of data science applications, particularly when they properly support streaming etc.

I would also like to add that I would like reduce scraping for the data hosts as much as possible to by including all the images that we can safely, legally redistribute as an additional objective.

julien-c commented 1 year ago

Are HuggingFace datasets all meant for model training?

also want to mention that many datasets (or subsets of datasets) hosted on HF are meant for evaluation not training, notably, all the eval or test splits.

And for instance we apply <meta name="robots" content="noindex,noai,noimageai" /> on those (in addition to human readable disclaimers) so that automated scraping tools hopefully do not lead to model training on them

zackkrida commented 1 year ago

Update: This week I am wrapping up the project proposal and implementation plan. The implementation plan PR will get into the technical aspects of how exactly we create the dataset, which fields to include, the preferred output format, and preferred transfer mechanism. I will solicit comments on that PR while it is a draft so I can incorporate advice from community contributors.

openverse-bot commented 1 year ago

Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

openverse-bot commented 1 year ago

Hi @zackkrida, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

AetherUnbound commented 1 year ago

Last week, many of the core maintainers of Openverse had the opportunity to meet together synchronously and discuss the efforts around publishing the Openverse dataset. After much discussion, the maintainer team has decided to pause work on this effort for the time being for a few reasons:

Engagement with the community - As we progressed with the project, we realized the paramount importance of deeper collaboration with the broader community, especially experts in the Creative Commons, licensing, and open data domains. We are committed to making decisions that uphold the rights and interests of creators. By pausing, we are taking the necessary time to consult, collaborate, and ensure that the release of this dataset is both appropriate and beneficial, and that it doesn't inadvertently negatively impact creators. This collaborative approach ensures that we are not only aligned with technical standards but also with the ethos and values that the broader community cherishes.
Focus on the core Openverse users - At its heart, Openverse is primarily a search engine. Our central mission has always been to offer an excellent search experience for individuals seeking to discover and utilize creators' works. We believe it is crucial to preserve and enrich the simplicity, effectiveness, and user-centric nature of our search experience. Hence, our commitment remains steadfast in directing our resources and efforts towards projects and features that elevate and refine the core experience of Openverse users: finding openly licensed media for a myriad of use-cases.
Concerns over dataset governance, access, management, and upkeep - The complexity and intricacies related to ensuring proper governance, access control, robust management, and regular upkeep of such an expansive dataset have come to light. We are committed to upholding the highest standards in data stewardship, and at present we do not have the capacity to address any issues that may come up. For instance: What if a creator wants to be removed from the dataset? How do we establish access restrictions that prevent erroneous use of certain licenses? How do we handle reports of sensitive content within the dataset? How do we address reports about incorrect licenses for certain creations? The effort necessary to be appropriate stewards of this data is non-trivial, and we do not want to discount the level of work this might require.
Project prioritization - On assessment of our currently ongoing projects and the plans we had scoped for the rest of the year, we believe that there are other initiatives that currently demand our attention and resources. In pursuing the scoping for the dataset creation, we put on-hold projects that would improve user safety and enable new ways to browse works within Openverse. We also recently revised our expected roadmap for 2023 to reflect the reduced availability of some of our maintainers through the end of the year. Reprioritizing in this way ensures that Openverse continues to evolve and serve its community effectively, and we are redirecting our focus to areas which we believe will bring more immediate value to our users.

We are deeply appreciative of the enthusiasm and support from the community around the Openverse dataset project, particularly @apolinario, @Skylion007, and others. Our decision to pause is by no means an end, but rather a strategic recalibration to ensure we deliver only the best for our community. We look forward to continuing to work with the community members we're currently in collaboration with, along with other institutions operating in a similar space and under related principles.

WordPress / openverse