Deltares / HYDROLIB-core

Core code around the I/O of the DHYDRO-suite
https://deltares.github.io/HYDROLIB-core/
MIT License
24 stars 4 forks source link

NaNs in pandas dataframes for optional fields hamper object construction #201

Open arthurvd opened 2 years ago

arthurvd commented 2 years ago

Describe the bug Various optional fields in the FM model input will be represented as nan when exported to a pandas DataFrame. When constructing a new object from such dataframe, unnecessary validation errors will occur, and the object is not created. (reported in #199)

To Reproduce Steps to reproduce the behavior:

  1. Run this script:
    
    from pathlib import Path
    import pandas as pd
    from hydrolib.core.io.crosssection.models import CrossDefModel
    from hydrolib.core.io.dimr.models import DIMR, FMComponent
    from hydrolib.core.io.mdu.models import FMModel

Specify paths

root = Path.cwd() modeldir = Path( "c:/checkouts/HYDROLIB-core_git/tests/data/input/e02/f152_1d2d_projectmodels_rhu/c04_DHydamo-MGB-initialisation/fm/" ) mdufile = "moergestels_broek.mdu"

fm = FMModel(modeldir / mdufile)

cross_def = pd.DataFrame([cs.dict for cs in fm.geometry.crossdeffile.definition])

definition = cross_def.to_dict("records")

crossdef_new = CrossDefModel(definition=cross_def.to_dict("records"))

2. At the last line, an exception will occur that follows from our custom enum validation, such as:

'float' object has no attribute 'lower' File "C:\checkouts\HYDROLIB-core_git\hydrolib\core\io\ini\util.py", line 48, in get_enum if entry.lower() == v.lower():

   (variable `v` is `nan` here)
3. And, when temporarily switching off this enum validation in `CrossDefModel`, then still the object creation gives validation errors follows from our custom enum validation, such as:

471 validation errors for CrossDefModel definition -> 0 -> circ_d0.300 -> frictionType value is not a valid enumeration member; permitted: 'Chezy', 'Manning', 'wallLawNikuradse', 'WhiteColebrook', 'StricklerNikuradse', 'Strickler', 'deBosBijkerk' (type=type_error.enum; enum_values=[<FrictionType.chezy: 'Chezy'>, <FrictionType.manning: 'Manning'>, <FrictionType.walllawnikuradse: 'wallLawNikuradse'>, <FrictionType.whitecolebrook: 'WhiteColebrook'>, <FrictionType.stricklernikuradse: 'StricklerNikuradse'>, <FrictionType.strickler: 'Strickler'>, <FrictionType.debosbijkerk: 'deBosBijkerk'>])



**Expected behavior**
Optional fields read from input may be represented by nan in pandas DataFrame, but subsequent object creation must directly be possible via `CrossDefModel(definition=cross_def.to_dict("records"))`.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Version info (please complete the following information):**
 - OS: [e.g. Windows]
 - Version [e.g. 0.1.5]

**Additional context**
Add any other context about the problem here.
evetion commented 2 years ago

This behaviour is warned for in https://deltares.github.io/HYDROLIB-core/0.2.0/tutorials/dataframe_handling/, although it is not solved in that example.

Fields that are None are automatically converted to NaN by pandas so the column is all numbers, and thus fast to operate on. I'd say it's up to the user when to convert these NaNs (they could be valid).

This is easily done by something like this:

cross_def.where(pd.notnull(cross_def),None).to_dict()

Or, if the user isn't concerned about performance, you create a DataFrame of objects

pd.DataFrame([cs.__dict__ for cs in fm.geometry.crossdeffile.definition], dtype=object)

In essence this is caused by pandas only using 1 value for missing data (NaN) for performance.

evetion commented 2 years ago

The above is a workaround, to fully fix this I propose:

Note that DataFrames are not needed for (temporary) copies of objects, you can just use object.copy(deep=True) (see https://pydantic-docs.helpmanual.io/usage/exporting_models/#modelcopy).

arthurvd commented 2 years ago

@LisaWeijers: can you let us know whether completely skipping the dataframes, and directly making a true copy of the CrossDefModel is also working well for you?, so using @evetion's suggestion above:

Note that DataFrames are not needed for (temporary) copies of objects, you can just use object.copy(deep=True) (see https://pydantic-docs.helpmanual.io/usage/exporting_models/#modelcopy).

arthurvd commented 2 years ago

This is easily done by something like this:

cross_def.where(pd.notnull(cross_def),None).to_dict()

In #255 I discovered that some scalar float columns were not accepting None, so first the dataframe must be cast into object type, as follows:

df2=df_structures.astype(object)
df2.where(pd.notnull(df2),None, inplace=True)