-
## `r5.xlarge`: Running out of disk space despite having a 50GB EBS volume & 36GB RAM with `cnt = cnt.compute(num_workers=10)`
- the two dataframes being joined together are from a 20 GB & 10GB avr…
-
At present, `DiskDataset` is our workhorse class for large datasets. This class is pretty nicely optimized with a cache and everything, and I've been able to use it on 50GB datasets without too much t…
-
Hi all,
We're hoping to use dask/s3fs with the below use case:
1) We have many large binary data files stored on S3, which we hope to process
2) Our aim is to load parts of the data into dask dat…
-
For many use cases (like xenium) points can be handled completley in memory without issue. Given that, and all the reasons the first "best practice" in the dask dataframes documentation is ["use panda…
-
It would be nice to be able to supply `kartothek.io.dask.delayed.merge_datasets_as_delayed` with a list of `dataset_uuids` to merge an arbitrary number of datasets.
This could be implemented by
…
-
As stated in the docs, "Blaze includes nascent support for out-of-core processing with Pandas DataFrames and NumPy NDArrays". http://blaze.readthedocs.org/en/latest/ooc.html#parallel-processing.
Sho…
-
We recently added a `dataframe.dtype_backend` config option for specifying whether classic `numpy`-backed dtypes (e.g. `int64`, `float64`, etc.) or `pyarrow`-backed dtypes (e.g. `int64[pyarrow]`, `flo…
-
Similar #1498. I think that as the queries are currently written it isn't a fair comparison between DataFrame API's.
For SQL it is fair as the TPCH benchmark states that all engines should parse th…
-
**Is your feature request related to a problem? Please describe.**
cudf columns are mutable and therefore do not (or should not) implement `__hash__` (in the same way that numpy arrays do not do so…
-
### 🚀 The feature, motivation and pitch
Startup time for `DataLoader` workers can be very slow when using a `Dataset` object of even moderate size. The reason is that each worker process is started…