-
After spending some time working out why my groupby operation was not working I came across https://examples.dask.org/dataframes/02-groupby.html#Many-groups. If It wasn't for the great docs around das…
-
When dataframes are shuffled, dask builds a hash of the index for each partition and buckets the hashes modulo n_partitions. cuDF has an optimized hash partitioning scheme:
https://github.com/rapi…
-
The `dask.datasets` module includes functions like `dask.dataset.timeseries` or `dask.datasets.make_people` for Dask dataframes or Dask bags respectively from random data.
It would be useful to hav…
-
## `r5.xlarge`: Running out of disk space despite having a 50GB EBS volume & 36GB RAM with `cnt = cnt.compute(num_workers=10)`
- the two dataframes being joined together are from a 20 GB & 10GB avr…
-
*edit by TomAugspurger*
Currently partitions within a dask DataFrame do not known their own length. Anything using the length of the DataFrame or the partitions will need to compute it at runtime. …
-
For many use cases (like xenium) points can be handled completley in memory without issue. Given that, and all the reasons the first "best practice" in the dask dataframes documentation is ["use panda…
-
The latest release `2024.3.0` enabled query planning for `DataFrame`s by default. This issue can be used to report feedback and ask related questions.
If you encountered a bug or unexpected behavio…
-
There are a number of optimised libraries for many packages, with optimsation at different levels...
## Intel Optimisations
* [Intel Extensions for Scikit-learn](https://intel.github.io/scikit-l…
-
```python
import dask
import dask.dataframe as dd
def process_df(df):
return df
def make_df():
return pd.DataFrame([[1, 3], [2, 3], [3, 4]] , columns=['A', 'B'])
a = dd.from_delay…
-
In SQL, it's common to work w/ large data and aggregate or filter it down to few enough rows that it could be merged into a single partition in memory.
Today you can achieve this with something lik…