"Croissant is a high-level format for machine learning datasets that brings together four rich layers."
This issue tracks activities related to our collaboration with the Kaggle Team related to Croissant.
Mission
Make datasets easier to find and work with for Machine Learning, at scale and by diverse stakeholders (e.g. AI engineers, AI ethicists[e], compliance managers, interested public)
Vision
Croissant is the most convenient and widely used machine-readable format for ML-ready datasets.
[ ] Include a facet for the Search API for datasets that only have files that are truly open (no custom terms, no guestbooks).
[ ] Let Kaggle know how many dataset and bytes to expect when copying CC0 dataset from Harvard Dataverse (see notes from 2024-07-18 meeting and Slack)
[ ] Let Kaggle know the best way to see when datasets have changed
[ ] Commit data from Dataverse to Kaggle via CroissantML via a button, as an explicit action from the user. Is this part of a larger story around pushing data to other systems, such as data lakes?
Issues we've opened or are keeping an eye on
Depending on the outcome of these issues, we may enhance our Croissant implementation to cover additional use cases.
Now that the Croissant exporter is in place and being indexed by Google, Harvard Dataverse is showing Croissant data in Google Dataset Search, as described in a mailing list post with a screenshot.
Emails exchanged with Croissant Task Force about file formats and DDI-CDI.
Overview
Mission
Vision
Issues
Issues we will probably work on
Issues we've opened or are keeping an eye on
Depending on the outcome of these issues, we may enhance our Croissant implementation to cover additional use cases.
Related