geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
486 stars 45 forks source link

ddf._meta_nonempty doesnt instantiate correctly when calling `from_dask_dataframe` #286

Open taneugene opened 5 months ago

taneugene commented 5 months ago

When I load a csv first into dask, and then into dask dataframe using .from_dask_dataframe, ._meta_nonempty does not exist, causing downstream problems in analysis (e.g. with spatial_shuffle). My hackish solution below takes the head, uses from_geopandas to get the meta, and the replaces the meta in the original. It would be nice to make this just work directly! Not sure if it replicates for other people.

# Load a csv file
df = dd.read_csv(fname,
                 dtype = {'longitude':float,
                          'latitude':float,
                          'geometry':'object',
                 }).repartition(npartitions=njobs)  # njobs is the number of workers I have
# Translate to geometry using shapely
df['geometry'] = df.geometry.map(shapely.wkt.loads,meta=('geometry','object'))
# Create a tmp dataframe using a Geodataframe and from_geopandas
tmp = dg.from_geopandas(gpd.GeoDataFrame(df.head(),geometry = 'geometry',crs = 'EPSG:4326'),npartitions = 1)

# Now create the dask_geopandas df
df = dg.from_dask_dataframe(df)

# Need to set metadata here, otherwise spatial_shuffle won't run. 
df._meta = tmp.compute()
df = df.spatial_shuffle()
TomAugspurger commented 5 months ago

Thanks for the report. Can you share a fully reproducible example so that I can look into it?