-
A common pattern in dask is to shuffle distributed data around by some hash-based index. For example, this comes up in merging dataframes. Since the determination of index buckets is typically carried…
-
joblib currently defaults to md5-hashing its input. For the tasks at hand, a non-cryptographic hash can be significantly faster (see comparison table at http://cyan4973.github.io/xxHash/).
scikit-lea…
-
I'm trying to set up the merge of two dataframes with a `CategoricalIndex` but am getting a confusing traceback (see below). I could track down the issue to the [align_partitions](https://github.com/d…
-
In GitLab by @Huite on Nov 18, 2023, 17:38
I got this report from Jacco.
If the resolution is small enough in these methods, it will result in duplicate entries in the output:
```python
imod.prepar…
-
### What is your issue?
_I think that this is a longstanding problem. Sorry if I missed an existing github issue._
I was looking at an Dask-array-backed Xarray workload with @phofl and we were bo…
-
In many cases we read tabular data from some source modify it, and write it out to another data destination. In this transfer we have an opportunity to tighten the data representation a bit, for exam…
-
The list of links will be more digestible if you classify a big chunk of links to smaller ones, so a reader can navigate. Here is a rough grouping proposal:
### 1. Python for earth scientists
##…
-
I naively tried to do `dd.merge(a, b, on="column_with_ten_values")`, where `a` and `b` were both large DataFrames with thousands of partitions each.
Eventually the compute failed with:
```python-t…
-
I would like to be able to call stack() on the Dask cudf dataframe. Currently stack is a function in the cudf dataframe, not in Dask cudf dataframe.
I would like the stacking to take place on the wor…
-
I'm running into a MemoryError when I try to save in a parquet format, or repartition(by size).
I didn't have this issue before, but after merging two dask dataframes, it's giving me this error.
The…