geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

Recommended way to write to S3? #374

Open codeananda opened 3 months ago

codeananda commented 3 months ago

I can easily read from S3 out of the box (assuming the required env variables are set).

But I cannot write to S3 out of the box.

This works

import geopandas
from dotenv import load_dotenv

load_dotenv(".env")

a = geopandas.read_file("s3://bucket-name/key.gpkg" ,engine='pyogrio')

But this doesn't

a.to_file("s3://bucket-name/written_by_geopandas.gpkg", engine='pyogrio')

Any ideas?

2024-03-11 15:11:06.305 | INFO     | __main__:_write_updated_titiles_to_disk:149 - Writing updated titles GeoDataFrame to disk.
2024-03-11 15:11:06.746 | ERROR    | writers:write:46 - 
        Could not write to file: s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg
        gdf.columns=#columns here...
      dtype='object')
        gdf.head()=    POLY_ID                                           geometry  ...  column_here another_column_here
0  56352124  POLYGON ((348033.380 169232.193, 348033.380 16...  ...                                         NaN                                NaN
1  54913918  POLYGON ((360220.150 169892.100, 360216.954 16...  ...                                         NaN                                NaN
2  56739811  POLYGON ((361916.946 179819.353, 361912.807 17...  ...                                         NaN                                NaN
3  54956921  POLYGON ((359997.850 167736.400, 359998.050 16...  ...                                         NaN                                NaN
4  19424703  POLYGON ((355649.200 176617.900, 355651.400 17...  ...                                         NaN                                NaN

[5 rows x 54 columns]
2024-03-11 15:11:06.747 | ERROR    | writers:write:51 - sqlite3_open(/vsis3/landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg) failed: unable to open database file
Traceback (most recent call last):

  File "pyogrio/_io.pyx", line 1603, in pyogrio._io.ogr_create
    ogr_dataset = exc_wrap_pointer(GDALCreate(ogr_driver, path_c, 0, 0, 0, GDT_Unknown, options))
  File "pyogrio/_err.pyx", line 179, in pyogrio._err.exc_wrap_pointer
    raise exc

pyogrio._err.CPLE_OpenFailedError: sqlite3_open(/vsis3/landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg) failed: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/ec2-user/big_bertha/compute_distances_and_intersections.py", line 699, in <module>
    output_gdf = intersector.compute_distances_and_intersections()
                 │           └ <function GridIntersector.compute_distances_and_intersections at 0x7f988cd51ab0>
                 └ <__main__.GridIntersector object at 0x7f988d7ec9a0>

  File "/home/ec2-user/big_bertha/compute_distances_and_intersections.py", line 116, in compute_distances_and_intersections
    output_file = self._write_updated_titiles_to_disk(titles_gdf)
                  │    │                              └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...
                  │    └ <function GridIntersector._write_updated_titiles_to_disk at 0x7f988cd51bd0>
                  └ <__main__.GridIntersector object at 0x7f988d7ec9a0>

  File "/home/ec2-user/big_bertha/compute_distances_and_intersections.py", line 154, in _write_updated_titiles_to_disk
    self._gdf_writer.write(titles_gdf, titles_output_file)
    │    │           │     │           └ S3Path('s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg')
    │    │           │     └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...
    │    │           └ <function GDFWriter.write at 0x7f988cd516c0>
    │    └ <writers.GDFWriter object at 0x7f988d7ee5c0>
    └ <__main__.GridIntersector object at 0x7f988d7ec9a0>

> File "/home/ec2-user/big_bertha/writers.py", line 44, in write
    gdf.to_file(output_path, mode=mode)
    │   │       │                 └ 'w'
    │   │       └ S3Path('s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg')
    │   └ <function GeoDataFrame.to_file at 0x7f988ea9d630>
    └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...

  File "/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/geopandas/geodataframe.py", line 1246, in to_file
    _to_file(self, filename, driver, schema, index, **kwargs)
    │        │     │         │       │       │        └ {'mode': 'w'}
    │        │     │         │       │       └ None
    │        │     │         │       └ None
    │        │     │         └ None
    │        │     └ S3Path('s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg')
    │        └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...
    └ <function _to_file at 0x7f988ea9f130>
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/geopandas/io/file.py", line 635, in _to_file
    _to_file_pyogrio(df, filename, driver, schema, crs, mode, **kwargs)
    │                │   │         │       │       │    │       └ {}
    │                │   │         │       │       │    └ 'w'
    │                │   │         │       │       └ None
    │                │   │         │       └ None
    │                │   │         └ 'GPKG'
    │                │   └ S3Path('s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg')
    │                └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...
    └ <function _to_file_pyogrio at 0x7f988ea9f250>
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/geopandas/io/file.py", line 685, in _to_file_pyogrio
    pyogrio.write_dataframe(df, filename, driver=driver, **kwargs)
    │       │               │   │                │         └ {}
    │       │               │   │                └ 'GPKG'
    │       │               │   └ S3Path('s3://landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg')
    │       │               └         POLY_ID                                           geometry  ...  os_air_travel_interchange_pct_intersection  os_trans...
    │       └ <function write_dataframe at 0x7f988cd03760>
    └ <module 'pyogrio' from '/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/py...
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/pyogrio/geopandas.py", line 548, in write_dataframe
    write(
    └ <function write at 0x7f988cd035b0>
  File "/home/ec2-user/.cache/pypoetry/virtualenvs/big-bertha-zeGGBcQi-py3.10/lib/python3.10/site-packages/pyogrio/raw.py", line 530, in write
    ogr_write(
    └ <cyfunction ogr_write at 0x7f988cec69b0>
  File "pyogrio/_io.pyx", line 1799, in pyogrio._io.ogr_write
    ogr_dataset = ogr_create(path_c, driver_c, dataset_options)
  File "pyogrio/_io.pyx", line 1612, in pyogrio._io.ogr_create
    raise DataSourceError(str(exc))
          └ <class 'pyogrio.errors.DataSourceError'>

pyogrio.errors.DataSourceError: sqlite3_open(/vsis3/landstack-big-bertha/grid_test_1/titles_with_distances_and_intersections_20240311_150903.gpkg) failed: unable to open database file
codeananda commented 3 months ago

Update: this works

import boto3
from cloudpathlib import S3Path
from loguru import logger

output_path = S3Path("s3://bucket-name/key.gpkg")

if isinstance(output_path, S3Path):
    a.to_file(output_path.name, engine='pyogrio')
    logger.info(f"Written to disk locally {output_path.name}")

    s3 = boto3.resource('s3')    
    s3.Bucket(output_path.bucket).upload_file(output_path.name, output_path.key)
    logger.info(f"Uploaded to S3!")