catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Resolve PyGeos error in nightly builds #1954

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

The 81d10c1dff2ae7da83593f70d694d694531a427e-dev nightly build passed but the subsequent build 5264d2664ca664b397793d369cf4a2dec1de2c0a-dev failed with a single error:

=================================== FAILURES ===================================
____________________ test_ferc714_outputs[summarize_demand] ____________________

ferc714_out = <pudl.output.ferc714.Respondents object at 0x7ff11eadceb0>
df_name = 'summarize_demand'

    @pytest.mark.parametrize(
        "df_name",
        [
            "annualize",
            "categorize",
            "summarize_demand",
            "fipsify",
        ],
    )
    def test_ferc714_outputs(ferc714_out, df_name):
        """Test FERC 714 derived output methods."""
        logger.info(f"Running ferc714_out.{df_name}()")
>       df = ferc714_out.__getattribute__(df_name)()

test/integration/output_test.py:207: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/pudl/output/ferc714.py:531: in summarize_demand
    self.georef_counties(update=update)
src/pudl/output/ferc714.py:614: in georef_counties
    self._counties_gdf = pudl.analysis.service_territory.add_geometries(
src/pudl/analysis/service_territory.py:135: in add_geometries
    census_gdf[["geoid10", "namelsad10", "dp0010001", "geometry"]]
env/lib/python3.10/site-packages/pandas/core/frame.py:4515: in assign
    data[k] = com.apply_if_callable(v, data)
env/lib/python3.10/site-packages/pandas/core/common.py:358: in apply_if_callable
    return maybe_callable(obj, **kwargs)
src/pudl/analysis/service_territory.py:144: in <lambda>
    .assign(area_km2=lambda x: x.geometry.to_crs(epsg=6933).area / 1e6)
env/lib/python3.10/site-packages/geopandas/geoseries.py:1117: in to_crs
    self.values.to_crs(crs=crs, epsg=epsg), index=self.index, name=self.name
env/lib/python3.10/site-packages/geopandas/array.py:767: in to_crs
    new_data = vectorized.transform(self.data, transformer.transform)
env/lib/python3.10/site-packages/geopandas/_vectorized.py:971: in transform
    result = pygeos.set_coordinates(data.copy(), np.array(new_coords).T)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

geometry = array([<pygeos.Geometry MULTIPOLYGON (((-15692305.544 5990802.138, -15692642.281 59...>,
       <pygeos.Geometry MULTI... 47....>,
       <pygeos.Geometry MULTIPOLYGON (((-92.079 43.955, -92.079 43.957, -92.079 43....>],
      dtype=object)
coordinates = array([[-15692305.54372587,   5990802.13803348],
       [-15692642.28084394,   5990371.46691768],
       [-15692919.00...  5084279.5278007 ],
       [ -8884360.29002698,   5084356.7100724 ],
       [ -8884364.05297662,   5084382.49888077]])

    def set_coordinates(geometry, coordinates):
        """Adapts the coordinates of a geometry array in-place.

        If the coordinates array has shape (N, 2), all returned geometries
        will be two-dimensional, and the third dimension will be discarded,
        if present. If the coordinates array has shape (N, 3), the returned
        geometries preserve the dimensionality of the input geometries.

        .. warning::

            The geometry array is modified in-place! If you do not want to
            modify the original array, you can do
            ``set_coordinates(arr.copy(), newcoords)``.

        Parameters
        ----------
        geometry : Geometry or array_like
        coordinates: array_like

        See Also
        --------
        apply : Returns a copy of a geometry array with a function applied to its
            coordinates.

        Examples
        --------
        >>> set_coordinates(Geometry("POINT (0 0)"), [[1, 1]])
        <pygeos.Geometry POINT (1 1)>
        >>> set_coordinates([Geometry("POINT (0 0)"), Geometry("LINESTRING (0 0, 0 0)")], [[1, 2], [3, 4], [5, 6]]).tolist()
        [<pygeos.Geometry POINT (1 2)>, <pygeos.Geometry LINESTRING (3 4, 5 6)>]
        >>> set_coordinates([None, Geometry("POINT (0 0)")], [[1, 2]]).tolist()
        [None, <pygeos.Geometry POINT (1 2)>]

        Third dimension of input geometry is discarded if coordinates array does
        not include one:

        >>> set_coordinates(Geometry("POINT Z (0 0 0)"), [[1, 1]])
        <pygeos.Geometry POINT (1 1)>
        >>> set_coordinates(Geometry("POINT Z (0 0 0)"), [[1, 1, 1]])
        <pygeos.Geometry POINT Z (1 1 1)>
        """
        geometry_arr = np.asarray(geometry, dtype=np.object_)
        coordinates = np.atleast_2d(np.asarray(coordinates)).astype(np.float64)
        if coordinates.ndim != 2:
            raise ValueError(
                "The coordinate array should have dimension of 2 "
                "(has {})".format(coordinates.ndim)
            )
        n_coords = lib.count_coordinates(geometry_arr)
        if (coordinates.shape[0] != n_coords) or (coordinates.shape[1] not in {2, 3}):
            raise ValueError(
                "The coordinate array has an invalid shape {}".format(coordinates.shape)
            )
>       lib.set_coordinates(geometry_arr, coordinates)
E       pygeos.GEOSException: IllegalArgumentException: Points of LinearRing do not form a closed linestring

These were the PRs that were merged between the two builds. At first glance, these PRs shouldn't have changed anything with this ferc714 test.

Debugging:

bendnorman commented 1 year ago

The build that finished on 9/28 passed despite running the code from 5264d2664ca664b397793d369cf4a2dec1de2c0a-dev, the previous built that failed :/ This makes me think the PyGeos failure could maybe be an API issue? I can't think of any flaky APIs that are a part of the census processing though. Any ideas @zaneselvans?

zaneselvans commented 1 year ago

I don't think there are any APIs involved. We've just got the one 2010 census DB, which should be the same as it's always been, and that's where the geometries are coming from. If something has changed with the geometries it's got to be in the Python GIS stack: GEOS, pygeos, shapely, fiona, geopandas. I don't think there's any random numbers in there either. Or you know, it could be cosmic rays! Not satisfying.

zaneselvans commented 1 year ago

We decided to punt this issue unless it becomes a recurring problem right? No idea why it's not consistent.

bendnorman commented 1 year ago

Yup, we've decided to punt.