e2nIEE / pandapower

Convenient Power System Modelling and Analysis based on PYPOWER and pandas
https://www.pandapower.org
Other
802 stars 466 forks source link

[bug] Serialization of shapely objects in dataframes creates "intermediate" products #2289

Open dlohmeier opened 1 month ago

dlohmeier commented 1 month ago

Bug report checklis

Reproducible Example

import pandas as pd
import pandapower as pp
import shapely

df = pd.DataFrame({"a": [1, 2], "b": [shapely.Point([1, 4]), shapely.LineString([[1, 2], [4, 6]])]})
json_str = pp.to_json(df)
df2 = pp.from_json_string(json_str)

print(df)
print(df2)

import geopandas as gpd

df2  = pd.DataFrame({"a": [1, 2], "b": [shapely.Point([1, 4]), shapely.LineString([[1, 2], [4, 6]])], "c": [shapely.Point([1, 9]), shapely.LineString([[1, 2], [4, 4]])]})
gdf = gpd.GeoDataFrame(df2, geometry="c")
json_str_gdf = pp.to_json(gdf)
gdf2 = pp.from_json_string(json_str)

Issue Description and Traceback

When running the above code, the shapely data is transferred into the internal pandapower serialization format. Upon deserialization, this format cannot be converted back, but is kept as a dict with multiple "useless" entries, such as "_module" or "_class". I assume that the reason behind this is that we pass the pandapower _toserializable handler as default_handler to pandas upon serialization, but we can't hand over a registry or decode-hook upon de-serialization. Is that correct? Do you have any idea of how to overcome this problem? I know that serializing a dataframe is not a good usecase for the pandapower.to_json function, but in some cases, I do store shapely data inside my net dataframes without making them geopandas dataframes. Additionally, I sometimes use more than just one column with geodata. For such cases, I added the geopandas part of the code. It is completely impossibly to encode GeoDataFrames that contain more geodata than just that inside the "geometry" column, as the following error occurs:

Traceback (most recent call last):
  File "/home/daniel/workspace/pandapower/pandapower/io_utils.py", line 448, in default
    s = to_serializable(o)
    ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniel/workspace/pandapower/pandapower/io_utils.py", line 985, in json_geodataframe
    d = with_signature(obj, obj.to_json())
                        ^^^^^^^^^^^^^
  File "/home/daniel/.virtualenvs/retoflow/lib/python3.11/site-packages/geopandas/geodataframe.py", line 782, in to_json
    return json.dumps(
       ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
       ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Point is not JSON serializable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/daniel/.virtualenvs/retoflow/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-0801906cf0dd>", line 1, in <module>
    gdf_str = pp.to_json({"geo": gdf})
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniel/workspace/pandapower/pandapower/file_io.py", line 132, in to_json
    json_string = json.dumps(net, cls=io_utils.PPJSONEncoder, indent=2)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
      ^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 202, in encode
    chunks = list(chunks)
         ^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.11/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.11/json/encoder.py", line 439, in _iterencode
    o = _default(o)
    ^^^^^^^^^^^
  File "/home/daniel/workspace/pandapower/pandapower/io_utils.py", line 451, in default
    return json.JSONEncoder.default(self, o)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type GeoDataFrame is not JSON serializable

Expected Behavior

It would be great to retrieve shapely data even from withtin dataframes or geodataframes outside the geometry column. Any ideas on that?

Installed Versions

INSTALLED VERSIONS

commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.11.9.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-35-generic Version : #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8 pandas : 1.5.3 numpy : 1.23.5 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.0.0 pip : 24.0 Cython : 3.0.9 pytest : 8.1.1 hypothesis : 6.82.7 sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader: None bs4 : 4.12.3 bottleneck : None brotli : 1.1.0 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.6.3 numba : 0.59.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.2 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.12.0 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 2.0.1 xlwt : None zstandard : None tzdata : 2024.1

Label

vogt31337 commented 2 weeks ago

I assume that the reason behind this is that we pass the pandapower to_serializable handler as default_handler to pandas upon serialization, but we can't hand over a registry or decode-hook upon de-serialization. Is that correct? Do you have any idea of how to overcome this problem?

I think it is possible to register new handler. But you have to write one. I don't know how the impact is on loading times and co. I think the same is true for the geometry column, since this is handled by a custom loading hook.