huggingface/datasets (datasets)
### [`v3.0.0`](https://redirect.github.com/huggingface/datasets/releases/tag/3.0.0)
[Compare Source](https://redirect.github.com/huggingface/datasets/compare/2.21.0...3.0.0)
#### Dataset Features
- Use Polars functions in `.map()`
- Allow Polars as valid output type by [@psmyth94](https://redirect.github.com/psmyth94) in [https://github.com/huggingface/datasets/pull/6762](https://redirect.github.com/huggingface/datasets/pull/6762)
- Example:
```python
>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
>>> cols = [pl.col("content").str.len_bytes().alias("length")]
>>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
>>> ds_with_length[:5]
shape: (5, 5)
┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
│ idx ┆ title ┆ content ┆ labels ┆ length │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ u32 │
╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
│ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure ┆ 180 │
│ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative ┆ 138 │
│ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentle_adventure ┆ 135 │
│ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarming_story ┆ 112 │
│ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111 │
└─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
```
- Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7118](https://redirect.github.com/huggingface/datasets/pull/7118)
#### Cache Changes
- Use `huggingface_hub` cache by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7105](https://redirect.github.com/huggingface/datasets/pull/7105)
- use the `huggingface_hub` cache for files downloaded from HF, by default at `~/.cache/huggingface/hub`
- cached datasets (Arrow files) will still be reloaded from the `datasets` cache, by default at `~/.cache/huggingface/datasets`
#### Breaking changes
- Remove deprecated code by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6996](https://redirect.github.com/huggingface/datasets/pull/6996)
- removed deprecated arguments like `use_auth_token`, `fs` or `ignore_verifications`
- Remove beam by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6987](https://redirect.github.com/huggingface/datasets/pull/6987)
- removed deprecated apache beam datasets support
- Remove metrics by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6983](https://redirect.github.com/huggingface/datasets/pull/6983)
- remove deprecated `load_metric`, please use the `evaluate` library instead
- Remove tasks by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6999](https://redirect.github.com/huggingface/datasets/pull/6999)
- remove deprecated `task` argument in `load_dataset()` `.prepare_for_task()` method, `datasets.tasks` module
#### General improvements and bug fixes
- Improved the tutorial by adding a link for loading datasets by [@AmboThom](https://redirect.github.com/AmboThom) in [https://github.com/huggingface/datasets/pull/7042](https://redirect.github.com/huggingface/datasets/pull/7042)
- Automatically create `cache_dir` from `cache_file_name` by [@ringohoffman](https://redirect.github.com/ringohoffman) in [https://github.com/huggingface/datasets/pull/7096](https://redirect.github.com/huggingface/datasets/pull/7096)
- remove more script docs by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7104](https://redirect.github.com/huggingface/datasets/pull/7104)
- Fix args of feature docstrings by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7103](https://redirect.github.com/huggingface/datasets/pull/7103)
- Temporarily pin numpy<2.1 to fix CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7114](https://redirect.github.com/huggingface/datasets/pull/7114)
- Fix ConnectionError for gated datasets and unauthenticated users by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7110](https://redirect.github.com/huggingface/datasets/pull/7110)
- Install transformers with numpy-2 CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7119](https://redirect.github.com/huggingface/datasets/pull/7119)
- don't mention the script if trust_remote_code=False by [@severo](https://redirect.github.com/severo) in [https://github.com/huggingface/datasets/pull/7120](https://redirect.github.com/huggingface/datasets/pull/7120)
- Fix typed examples iterable state dict by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7121](https://redirect.github.com/huggingface/datasets/pull/7121)
- Rename LargeList.dtype to LargeList.feature by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7106](https://redirect.github.com/huggingface/datasets/pull/7106)
- Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7125](https://redirect.github.com/huggingface/datasets/pull/7125)
- Disable implicit token in CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7126](https://redirect.github.com/huggingface/datasets/pull/7126)
- Test get_dataset_config_info with non-existing/gated/private dataset by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7124](https://redirect.github.com/huggingface/datasets/pull/7124)
- fix streaming from arrow files by [@fschlatt](https://redirect.github.com/fschlatt) in [https://github.com/huggingface/datasets/pull/7083](https://redirect.github.com/huggingface/datasets/pull/7083)
#### New Contributors
- [@AmboThom](https://redirect.github.com/AmboThom) made their first contribution in [https://github.com/huggingface/datasets/pull/7042](https://redirect.github.com/huggingface/datasets/pull/7042)
- [@fschlatt](https://redirect.github.com/fschlatt) made their first contribution in [https://github.com/huggingface/datasets/pull/7083](https://redirect.github.com/huggingface/datasets/pull/7083)
**Full Changelog**: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
[ ] If you want to rebase/retry this PR, check this box
This PR contains the following updates:
==2.21.0
->==3.0.0
Release Notes
huggingface/datasets (datasets)
### [`v3.0.0`](https://redirect.github.com/huggingface/datasets/releases/tag/3.0.0) [Compare Source](https://redirect.github.com/huggingface/datasets/compare/2.21.0...3.0.0) #### Dataset Features - Use Polars functions in `.map()` - Allow Polars as valid output type by [@psmyth94](https://redirect.github.com/psmyth94) in [https://github.com/huggingface/datasets/pull/6762](https://redirect.github.com/huggingface/datasets/pull/6762) - Example: ```python >>> from datasets import load_dataset >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars") >>> cols = [pl.col("content").str.len_bytes().alias("length")] >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True) >>> ds_with_length[:5] shape: (5, 5) ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐ │ idx ┆ title ┆ content ┆ labels ┆ length │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str ┆ u32 │ ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡ │ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure ┆ 180 │ │ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative ┆ 138 │ │ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentle_adventure ┆ 135 │ │ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarming_story ┆ 112 │ │ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111 │ └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘ ``` - Support NumPy 2 - Allow numpy-2.1 and test it without audio extra by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7118](https://redirect.github.com/huggingface/datasets/pull/7118) #### Cache Changes - Use `huggingface_hub` cache by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7105](https://redirect.github.com/huggingface/datasets/pull/7105) - use the `huggingface_hub` cache for files downloaded from HF, by default at `~/.cache/huggingface/hub` - cached datasets (Arrow files) will still be reloaded from the `datasets` cache, by default at `~/.cache/huggingface/datasets` #### Breaking changes - Remove deprecated code by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6996](https://redirect.github.com/huggingface/datasets/pull/6996) - removed deprecated arguments like `use_auth_token`, `fs` or `ignore_verifications` - Remove beam by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6987](https://redirect.github.com/huggingface/datasets/pull/6987) - removed deprecated apache beam datasets support - Remove metrics by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6983](https://redirect.github.com/huggingface/datasets/pull/6983) - remove deprecated `load_metric`, please use the `evaluate` library instead - Remove tasks by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/6999](https://redirect.github.com/huggingface/datasets/pull/6999) - remove deprecated `task` argument in `load_dataset()` `.prepare_for_task()` method, `datasets.tasks` module #### General improvements and bug fixes - Improved the tutorial by adding a link for loading datasets by [@AmboThom](https://redirect.github.com/AmboThom) in [https://github.com/huggingface/datasets/pull/7042](https://redirect.github.com/huggingface/datasets/pull/7042) - Automatically create `cache_dir` from `cache_file_name` by [@ringohoffman](https://redirect.github.com/ringohoffman) in [https://github.com/huggingface/datasets/pull/7096](https://redirect.github.com/huggingface/datasets/pull/7096) - remove more script docs by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7104](https://redirect.github.com/huggingface/datasets/pull/7104) - Fix args of feature docstrings by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7103](https://redirect.github.com/huggingface/datasets/pull/7103) - Temporarily pin numpy<2.1 to fix CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7114](https://redirect.github.com/huggingface/datasets/pull/7114) - Fix ConnectionError for gated datasets and unauthenticated users by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7110](https://redirect.github.com/huggingface/datasets/pull/7110) - Install transformers with numpy-2 CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7119](https://redirect.github.com/huggingface/datasets/pull/7119) - don't mention the script if trust_remote_code=False by [@severo](https://redirect.github.com/severo) in [https://github.com/huggingface/datasets/pull/7120](https://redirect.github.com/huggingface/datasets/pull/7120) - Fix typed examples iterable state dict by [@lhoestq](https://redirect.github.com/lhoestq) in [https://github.com/huggingface/datasets/pull/7121](https://redirect.github.com/huggingface/datasets/pull/7121) - Rename LargeList.dtype to LargeList.feature by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7106](https://redirect.github.com/huggingface/datasets/pull/7106) - Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7125](https://redirect.github.com/huggingface/datasets/pull/7125) - Disable implicit token in CI by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7126](https://redirect.github.com/huggingface/datasets/pull/7126) - Test get_dataset_config_info with non-existing/gated/private dataset by [@albertvillanova](https://redirect.github.com/albertvillanova) in [https://github.com/huggingface/datasets/pull/7124](https://redirect.github.com/huggingface/datasets/pull/7124) - fix streaming from arrow files by [@fschlatt](https://redirect.github.com/fschlatt) in [https://github.com/huggingface/datasets/pull/7083](https://redirect.github.com/huggingface/datasets/pull/7083) #### New Contributors - [@AmboThom](https://redirect.github.com/AmboThom) made their first contribution in [https://github.com/huggingface/datasets/pull/7042](https://redirect.github.com/huggingface/datasets/pull/7042) - [@fschlatt](https://redirect.github.com/fschlatt) made their first contribution in [https://github.com/huggingface/datasets/pull/7083](https://redirect.github.com/huggingface/datasets/pull/7083) **Full Changelog**: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.