Open guozhans opened 8 months ago
Can you post a reproducible example? E.g. have some code that creates the parquet files that you read. I have trouble reproducing the error that you are seeing
Hi @phofl
I attached small dataset and my pip installation at bottom, you can try this dataset with my another script. the dataset is quite small, it suppose won't cause you OOM. At least it doesn't cause OOM here. :)
Also you must enable worker plugin with loky
If you use nanny, the script will run successfully.
import logging
import dask
import dask.dataframe as dd
import pandas as pd
import dask.config as dc
from dask.delayed import delayed
from distributed import WorkerPlugin, Worker, LocalCluster, Client
from loky import ProcessPoolExecutor
class TaskExecutorPool(WorkerPlugin):
def __init__(self, logger, name):
self.logger = logger
self.worker = None
self.name = name
def setup(self, worker: Worker):
executor = ProcessPoolExecutor(max_workers=worker.state.nthreads)
worker.executors[self.name] = executor
self.worker = worker
def transition(self, key, start, finish, *args, **kwargs):
if finish == 'error':
ts = self.worker.tasks[key]
exc_info = (type(ts.exception), ts.exception, ts.traceback)
print(f"Task traceback: {ts.traceback}")
print(f"Task exception: {exc_info}")
self.logger.error(f"Error during computation of {key}, caused by {str(ts.exception)}.")
def main():
cluster = LocalCluster(n_workers=4, processes=False, silence_logs=logging.DEBUG)
with Client(cluster) as client:
client.register_plugin(TaskExecutorPool(logging, "process"), name="process")
with dask.annotate(executor="process", retries=10):
dc.set({"dataframe.convert-string": False})
ways = dd.read_parquet(
"djibouti-latest.osm/way", columns=["id", "nodes"], blocksize="8MiB")
node_coordinates = dd.read_parquet(
"djibouti-latest.osm/node", columns=["latitude", "longitude"], index=["id"])
way_dfs = ways.to_delayed()
delays = []
for way_df in way_dfs:
delays.append(delayed(create_df)(way_df))
dfs = dd.compute(*delays)
df = dd.concat([*dfs])
df = dd.merge(df, node_coordinates, left_on=["nodeId"], right_on=["id"], right_index=True).set_index("id", shuffle_method="p2p")
print(f"df = {df.compute()}")
def create_df(way_df):
new_df = way_df.set_index("id").nodes.apply(
lambda ns: pd.Series([n["nodeId"] for n in ns], dtype="int64")
).convert_dtypes(convert_integer=True).stack().reset_index(0, name="nodeId")
return dd.from_pandas(new_df, npartitions=10)
if __name__ == "__main__":
main()
My pip install pip install dask-kubernetes==2024.3.1 dask[complete]=="2024.2.1" dask-geopandas pandas==2.1.4 pandas[performance]==2.1.4 numpy==1.22.4 jupyter-server-proxy pyarrow==15.0.2 shapely==2.0.3 pyproj==3.6.1 geopandas==0.14.3 geoparquet==0.0.3 wheel loky==3.4.1 graphviz
Thanks! Is it possible to reproduce this without the tar file?
I don't understand, and can't you use this data? or you are check something? Or do you mean Dask can only work on specific dataset? Perhaps you could give more context.
I didn't try to reproduce it with other data, but i observe similar data set has same issue.
It's always better to have something that can just be copy-paste for developers. Here is some context about why downloading those files can be a little concerning
https://github.com/dask/dask/issues/10995#issuecomment-2014736296
Hi @phofl I see, and thanks for providing some context.
If i copy and paste, somehow it will change data format. the attached file only has few hundreds lines. I hope it will work for you..
Or you can download complete Île de Clipperton PBF data from https://download.geofabrik.de/australia-oceania.html and transform it into parquet files by using osm-parquetizer, and then to separate ways nodes
Sam
Describe the issue: Hi I encountered this error, and don't know what happened under the hood. Therefore, I open it for better tracking.
I have some spatial datasets in parquet format with row group size 64MB that contains nodes and coordinates. I never intend to do any re-partition or shuffle operations, but it could happen during merge, or set index operation, etc.. The hash join issue was found with the dataset bigger more than one row group size enough to create few partitions in dataframe on hash join operations. The error always showed with hash-join-transfer-xxxxxxxxx operation.
To workaround this issue, i changed shuffle method back to "tasks".
Error messages:
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment: