-
this concerns the scripts for extracting summary stats, and possibly the conversion of tables to datasets for LLM training.
for the summary stats, we're currently using the python engine to read in…
-
Hi,
I'm writing a notebook example to highlight some key differences between pandas and dask. Are you interested in such a PR?
If so i have currently the following topics - (are there any addition…
-
It would be nice to be able to supply `kartothek.io.dask.delayed.merge_datasets_as_delayed` with a list of `dataset_uuids` to merge an arbitrary number of datasets.
This could be implemented by
…
-
One level, the fallback, would be the prototype in #8. This should always work, but is expensive since it requires compact Xarray datasets to be unraveled.
The other level would be more like xql tod…
-
Dask supports various serialization methods for its DataFrames (see [here](https://distributed.dask.org/en/latest/serialization.html)), and for the `EnsembleFrame` hierarchy we should validate that we…
-
**Need Dask Dataframe support for Create_REPORT - Need to materialize computes**
When the input dataframe is constructed from Dask.DataFrame , create_report(df) throws error
"Missing Cells": float…
-
Currently, when trying out [this notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb) with a CPU Dask DataFrame, …
-
I suspect that we will often want to use tsqr with unknown chunk sizes, which occur whenever someone converts a dask dataframe into a dask array (dataframes don't maintain chunk sizes). Currently we…
-
I am working on making a PR for #8331 and ran into `dd.concat`. (coming soon)
As mentioned elsewhere #7500, #7473, the `ignore_index` keyword is not being used.
Here is a test I added that fails w…
-
I'm unable to write a particular dataframe to S3.
## Code overview
Read a parquet file from S3
```python
import dask.dataframe as dd
df = dd.read_parquet(path=[f's3://{bucket}/{path_to_file}'],…