-
**Is your feature request related to a problem? Please describe.**
NeMo curator supports document datasets as dataframes today and includes some helpers to read from json/parquet files.
**Describe…
-
The current navbar has links to pages that I think mostly don't matter.
- distributed docs are way too technical for most users
- dask-ml is not actually all that focused on dask + ml users…
-
**Is your feature request related to a problem? Please describe.**
cudf columns are mutable and therefore do not (or should not) implement `__hash__` (in the same way that numpy arrays do not do so…
-
Hi,
Thank you for dask ! 🙏
**Describe the issue**:
Following this [discussion on discord](https://dask.discourse.group/t/read-parquet-filters-not-working-with-query-optimizer/2912), and a…
-
At present, `DiskDataset` is our workhorse class for large datasets. This class is pretty nicely optimized with a cache and everything, and I've been able to use it on 50GB datasets without too much t…
-
# Testing Plan
## Dummy Credit Card Application Dataset
### Test 1
- Read in each dataset into a dataframe
- time creating the dataframe for each
- Join the dataframes
- Filter out USA…
-
The `dask.datasets` module includes functions like `dask.dataset.timeseries` or `dask.datasets.make_people` for Dask dataframes or Dask bags respectively from random data.
It would be useful to hav…
-
I used dask (and xarray) to combine a set of H5py files into a dataframe.
This worked great until I updated dask from 2.28 to 2021.07.1.
If I run the same script now, I always run out of memory, …
-
From @rabernat on [Twitter](https://twitter.com/rabernat/status/1330707155742322689):
> "Xarray has some secret private classes for lazily indexing / wrapping arrays that are so useful I think they…
-
When dataframes are shuffled, dask builds a hash of the index for each partition and buckets the hashes modulo n_partitions. cuDF has an optimized hash partitioning scheme:
https://github.com/rapi…