-
It would be nice to support loading data as dask.delayed dataframes (like hax.minitrees.load, or at least somewhat like it), so we can use its parallel and out of core computation functions.
-
The `dask.datasets` module includes functions like `dask.dataset.timeseries` or `dask.datasets.make_people` for Dask dataframes or Dask bags respectively from random data.
It would be useful to hav…
-
When dataframes are shuffled, dask builds a hash of the index for each partition and buckets the hashes modulo n_partitions. cuDF has an optimized hash partitioning scheme:
https://github.com/rapi…
-
I used dask (and xarray) to combine a set of H5py files into a dataframe.
This worked great until I updated dask from 2.28 to 2021.07.1.
If I run the same script now, I always run out of memory, …
-
Similar #1498. I think that as the queries are currently written it isn't a fair comparison between DataFrame API's.
For SQL it is fair as the TPCH benchmark states that all engines should parse th…
-
It would be nice to be able to supply `kartothek.io.dask.delayed.merge_datasets_as_delayed` with a list of `dataset_uuids` to merge an arbitrary number of datasets.
This could be implemented by
…
-
Hi, Thank you for your hard work.
I am evaluating 2d zetas(twisted with harmonic polynomials): the lemniscate.
just sum over x,y for (x^4 - 6. * x^2 * y^2 + y^4)/(x^2+y^2)^4
I can do this in nu…
-
## `r5.xlarge`: Running out of disk space despite having a 50GB EBS volume & 36GB RAM with `cnt = cnt.compute(num_workers=10)`
- the two dataframes being joined together are from a 20 GB & 10GB avr…
-
**Exception**
```
ValueError: The columns in the computed data do not match the columns in the provided metadataOrder of columns does not match
```
**Repro code**
```
from dask.dataframe import …
-
In SQL, it's common to work w/ large data and aggregate or filter it down to few enough rows that it could be merged into a single partition in memory.
Today you can achieve this with something lik…