geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
257 stars 21 forks source link

BUG: fix write of kml lon/lat transpose #421

Closed nicholas-ys-tan closed 2 weeks ago

nicholas-ys-tan commented 3 weeks ago

Closes #420

My cython is atrocious.

Followed what was done in fiona, qgis and gdal documentation to find this is probably what needs to be done. But I am probably doing it too broadly and may have unintended consequences in other filetypes.

~Still need to write tests.~

Need to investigate how this may impact other file types.

jorisvandenbossche commented 3 weeks ago

We should probably also test this for appending to an existing file (not entirely sure how that will work in that case, because then we don't create the layer object with constructed CRS object ?)

nicholas-ys-tan commented 3 weeks ago

Thanks @brendan-ward ,

Have added your changes and test, will just need to do this one tomorrow:

We should probably also test this for appending to an existing file (not entirely sure how that will work in that case, because then we don't create the layer object with constructed CRS object ?)

nicholas-ys-tan commented 2 weeks ago

I am currently investigating this:

We should probably also test this for appending to an existing file (not entirely sure how that will work in that case, because then we don't create the layer object with constructed CRS object ?)

There seems to be an issue unrelated to this ticket - I was testing the appending to a .kml file, and it introduces a Z-dimension. I am not sure if it is known behaviour but I couldn't find an existing ticket about it. Interestingly, it has nothing to do with append but happens every time a .kml file is over-written.

points = [Point(10, 20), Point(30, 40), Point(50, 60)]
gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")
output_path = r'/home/nicholas/dev/data/ogr_test/temporary_kml_file.kml'
write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=True
)

gdf_read = read_dataframe(output_path, use_arrow=True)
print(gdf_read.geometry)
0    POINT (10 20)
1    POINT (30 40)
2    POINT (50 60)
Name: geometry, dtype: geometry
points_append = [Point(70, 80), Point(90, 100), Point(110, 120)]
gdf_append = gpd.GeoDataFrame(geometry=points_append, crs="EPSG:4326")
write_dataframe(
    gdf_append, output_path, layer='tmp_layer', driver="KML", use_arrow=True, append=True
)
gdf_read_appended = read_dataframe(output_path, use_arrow=True)

print(gdf_read_appended.geometry)
0      POINT Z (10 20 0)
1      POINT Z (30 40 0)
2      POINT Z (50 60 0)
3      POINT Z (70 80 0)
4     POINT Z (90 100 0)
5    POINT Z (110 120 0)
Name: geometry, dtype: geometry

This doesn't seem to be related to the append feature, but to over-writing a kml file.

points = [Point(10, 20), Point(30, 40), Point(50, 60)]
gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")
output_path = r'/home/nicholas/dev/data/ogr_test/temporary_kml_file2.kml'
write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=True
)

write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=True
)

print(gdf_read.geometry)
0    POINT Z (10 20 0)
1    POINT Z (30 40 0)
2    POINT Z (50 60 0)
Name: geometry, dtype: geometry

On looking into the KML, can confirm it gets restructured

Before:

<?xml version="1.0" encoding="utf-8" ?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document id="root_doc">
<Folder><name>tmp_layer</name>
  <Placemark>
      <Point><coordinates>10,20</coordinates></Point>
  </Placemark>
  <Placemark>
      <Point><coordinates>30,40</coordinates></Point>
  </Placemark>
  <Placemark>
      <Point><coordinates>50,60</coordinates></Point>
  </Placemark>
</Folder>
</Document></kml>

After over-writing with same data:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Document id="root_doc">
    <Document id="tmp_layer">
      <name>tmp_layer</name>
      <Placemark id="tmp_layer.1">
        <Point>
          <coordinates>
            10,20,0
          </coordinates>
        </Point>
      </Placemark>
      <Placemark id="tmp_layer.2">
        <Point>
          <coordinates>
            30,40,0
          </coordinates>
        </Point>
      </Placemark>
      <Placemark id="tmp_layer.3">
        <Point>
          <coordinates>
            50,60,0
          </coordinates>
        </Point>
      </Placemark>
    </Document>
  </Document>
</kml>

This happens in the main branch too, but with the reverse co-ordinates.

I assume this is not something to be addressed within this ticket. I haven't yet confirmed if this is an upstream bug but I am currently assuming it is. As a work-around, I've modified the test for the appended KML file to only compare the xy-coordinates by reading with force_2d=True.

If this isn't already a known issue, I'll do some digging as to why this is happening.

nicholas-ys-tan commented 2 weeks ago

the ubuntu-small CIs don't seem to like to open the produced KML files for appending - will need to look more into it this weekend.

theroggy commented 2 weeks ago

the ubuntu-small CIs don't seem to like to open the produced KML files for appending - will need to look more into it this weekend.

Not sure why this leads to appending not working... but FYI: apparently ubuntu-small doesn't contain the LIBKML driver, so according to de doc kml handling will fallback to the KML driver in that case.

nicholas-ys-tan commented 2 weeks ago

Not sure why this leads to appending nog working... but FYI: apparently ubuntu-small doesn't contain the LIBKML driver, so according to de doc kml handling will fallback to the KML driver in that case.

Oh that is interesting, one thing I did note but neglected to mention in my above text about the Z-dimension is that when I run write_dataframe with LIBKML, it always writes out the Z-dimension even on first write.

This seems to suggest when there is an existing KML file, LIBKML is used to over-write the KML file, hence why the Z-dimension is introduced, and why it is failing in ubuntu-small.

points = [Point(10, 20), Point(30, 40), Point(50, 60)]
gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")
output_path = r'/home/nicholas/dev/data/ogr_test/temporary_kml_file7.kml'
write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="LIBKML", use_arrow=True
)
gdf_read = read_dataframe(output_path, use_arrow=True)

print(gdf_read.geometry)
0    POINT Z (10 20 0)
1    POINT Z (30 40 0)
2    POINT Z (50 60 0)
Name: geometry, dtype: geometry

and outputs the same KML format as the over-written file from my earlier comment:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Document id="root_doc">
    <Document id="tmp_layer">
      <name>tmp_layer</name>
      <Placemark id="tmp_layer.1">
        <Point>
          <coordinates>
            10,20,0
          </coordinates>
        </Point>
      </Placemark>
      <Placemark id="tmp_layer.2">
        <Point>
          <coordinates>
            30,40,0
          </coordinates>
        </Point>
      </Placemark>
      <Placemark id="tmp_layer.3">
        <Point>
          <coordinates>
            50,60,0
          </coordinates>
        </Point>
      </Placemark>
    </Document>
  </Document>
</kml>

I'll investigate further to confirm tomorrow morning.

brendan-ward commented 2 weeks ago

It looks like always adding the Z value when using LIBKML is an artifact of that particular driver and also possibly because Z value is expected by the KML spec? Whereas the KML driver does not automatically add Z values.

Can you log a separate issue that appending a KML file also adds Z value when using the KML driver? Then we can look into that separately from this PR. I'm wondering if in the case of append GDAL switches to using the LIBKML driver if available.

jorisvandenbossche commented 2 weeks ago

the ubuntu-small CIs don't seem to like to open the produced KML files for appending - will need to look more into it this weekend.

Not sure why this leads to appending not working... but FYI: apparently ubuntu-small doesn't contain the LIBKML driver, so according to de doc kml handling will fallback to the KML driver in that case.

The KML driver might just not support appending? In that case you will have to skip the test (or the second part of the test) when LIBKML driver is not available (you can use something like "LIBKML" not in pyogrio.list_drivers())

BTW, the error message we give is a bit confusing. We are adding the suggestion "It might help to specify the correct driver explicitly by prefixing the file path with ':', e.g. 'CSV:path'", which is strange because when writing, the user already explicitly mentions the driver (which in practice is not used when appending I think, but that's not really clear for the user)

nicholas-ys-tan commented 2 weeks ago

The KML driver might just not support appending? In that case you will have to skip the test (or the second part of the test) when LIBKML driver is not available (you can use something like "LIBKML" not in pyogrio.list_drivers())

Thanks @jorisvandenbossche , I think it's not just appending but also over-writing. It seems to use LIBKML if over-writing an existing file regardless of append=True which is quite interesting.

Thanks for the pointer, I was looking for how pyogrio is able to check what drivers are available. Will push the fixes shortly.

points = [Point(10, 20), Point(30, 40), Point(50, 60)]
gdf = gpd.GeoDataFrame(geometry=points, crs="EPSG:4326")
output_path = r'/home/nicholas/dev/data/ogr_test/temporary_kml_file2.kml'
write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=True
)

write_dataframe(
    gdf, output_path, layer="tmp_layer", driver="KML", use_arrow=True
)

print(gdf_read.geometry)
0    POINT Z (10 20 0)
1    POINT Z (30 40 0)
2    POINT Z (50 60 0)
Name: geometry, dtype: geometry
jorisvandenbossche commented 2 weeks ago

I think it's not just appending but also over-writing. It seems to use LIBKML if over-writing an existing file regardless of append=True which is quite interesting.

Yes, but this test it's the append case that matters (so that's the reason I mentioned it explicitly). It seems that the KML driver essentially doesn't allow to open a file in Update mode (although I don't directly see anything in the GDAL code to support that), and when you are appending or writing to an existing file, in both cases our code in ogr_open will call GDALOpenEx with the GDAL_OF_UPDATE flag (when reading, we will call that with GDAL_OF_READONLY flag, that seems to be the main difference). Only in case of overwriting, we will then delete the layer if we could open the file and the layer exists, while in case of append=True we don't do that.

nicholas-ys-tan commented 2 weeks ago

Yes, but this test it's the append case that matters (so that's the reason I mentioned it explicitly). It seems that the KML driver essentially doesn't allow to open a file in Update mode (although I don't directly see anything in the GDAL code to support that), and when you are appending or writing to an existing file, in both cases our code in ogr_open will call GDALOpenEx with the GDAL_OF_UPDATE flag (when reading, we will call that with GDAL_OF_READONLY flag, that seems to be the main difference). Only in case of overwriting, we will then delete the layer if we could open the file and the layer exists, while in case of append=True we don't do that.

Thanks Joris, that all makes sense - I'll include this information in a separate ticket to document.

jorisvandenbossche commented 2 weeks ago

With this fix in, shall we do a 0.8.1 release? (this seems like an important fix to get out for when geopandas switches to use pyogrio by default)