geopandas / dask-geopandas

Parallel GeoPandas with Dask
https://dask-geopandas.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
486 stars 45 forks source link

Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0? #261

Open 1jiangxd opened 7 months ago

1jiangxd commented 7 months ago

Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0?

(Two shp files have been uploaded to my GitHub repository) https://github.com/1jiangxd/daskgeopandasproblems

The code I used is as follows, but when checking proceed '201105. shp', only the first 2 million lines were processed, and the remaining other original content changed into 0 May I ask where the problem lies with this code? If anyone can answer, I would greatly appreciate your help

import geopandas as gpd
import time

import dask_geopandas

def process_row(row):
    outwen = r'201105.shp'
    bianjie = r'2023xian.shp'
    jiabianjie = r'E:\201105out'

    start_time3 = time.time()

    # Read input and clipped boundary shapefiles
    target_gdf = gpd.read_file(outwen)
    join_gdf = gpd.read_file(bianjie)

    # Switch to dask approach
    target_gdfnew = dask_geopandas.from_geopandas(target_gdf, npartitions=4)

    # Reproject the boundary participating in the join to match the CRS of the target geometry
    join_gdf = join_gdf.to_crs(target_gdf.crs)

    # Switch to dask approach
    join_gdfnew = dask_geopandas.from_geopandas(join_gdf, npartitions=4)

    # Use spatial join to find intersecting parts
    joined = gpd.sjoin(target_gdfnew, join_gdfnew, how='inner', predicate='intersects')

    # Add attributes from 'bianjie' to 'outwen'
    joined = joined.drop(columns='index_right')  # Remove redundant index column
    result = target_gdfnew.merge(joined, how='left', on=target_gdfnew.columns.to_list())

    # Save the result to the output boundary
    result.to_file(jiabianjie, encoding='utf-8-sig')  # Ensure the correct encoding is used

    end_time3 = time.time()
    execution_time3 = end_time3 - start_time3

    print(f"'{jiabianjie}' has added boundaries. Start time: {start_time3:.2f}, End time: {end_time3:.2f}, Execution time: {execution_time3:.2f} seconds")

process_row()

print('Finish')
jorisvandenbossche commented 4 months ago

@1jiangxd apologies for the slow reply, but looking at your code, the following lines

    # Add attributes from 'bianjie' to 'outwen'
    joined = joined.drop(columns='index_right')  # Remove redundant index column
    result = target_gdfnew.merge(joined, how='left', on=target_gdfnew.columns.to_list())

are typically not needed. The result of the spatial join, joined, already has the columns of the original target_gdf, so this additional merge is not doing anything, except for getting back the original rows of target_gdf that didn't have a match in the spatial join. To achieve the same, you do a left join (specifying how='left' in thesjoin` call).

Also, I assume that the gpd.sjoin in your code above should be dask_geopandas.sjoin ?