VForWaTer / metacatalog

Modular metadata management platform for environmental data.
https://vforwater.github.io/metacatalog
GNU General Public License v3.0
3 stars 1 forks source link

Exported array data of type decimal.Decimal #147

Closed AlexDo1 closed 3 years ago

AlexDo1 commented 3 years ago

I just exported eddy data from metacatalog. I wanted to calculate some values like the minimum to check the data: edat['u'].min() which leads to the following error: InvalidOperation: [<class 'decimal.InvalidOperation'>]

All values in the exported data frame are of type decimal.Decimal, it seems that pandas cannot perform operations like .min() and .max() on this data type.

I would have to convert the series to type float to calculate min and max.

I don`t think this is ideal, as metacatalog should work with pandas smoothly.

dtypes for the imported data in metacatalog are defined here: https://github.com/VForWaTer/metacatalog/blob/15ee52f0e7b964e0b1df19132696cbe06a208bbe/metacatalog/ext/io/importer.py#L126-L131

Could we use a datatype like 'data': ARRAY(sa.NUMERIC) to solve this issue?

mmaelicke commented 3 years ago

Yes. That is a good spot. Great!

Does switching to Numeric solve the issue?

AlexDo1 commented 3 years ago

I just switched to Numeric and it does not solve the issue, the datatype is still Decimal.

Should we define the dtypes in reader.py like we did in importer.py?

mmaelicke commented 3 years ago

Yes. As far as I know, the Decimal is the only problem here, the DateTime should convert smoothly. I can remember, that I found a coversion function in sqlalchemy or pandas, which does exactly this one day. Can't really remember....

AlexDo1 commented 3 years ago

The problem is solved by adding the dtype parameter to pd.DataFrame() here: https://github.com/VForWaTer/metacatalog/blob/15ee52f0e7b964e0b1df19132696cbe06a208bbe/metacatalog/ext/io/reader.py#L64

df = pd.DataFrame(data=raw, columns=col_names, dtype=np.float64,index=df_sql.index)

This way, the columns of the exported Dataframe are of type float64.

mmaelicke commented 3 years ago

Alright, then we go for this. It will convert integer-based fields as well and take more space than necessary in these cases, but that does not really matter. If we run into performance issues, we can come back to this issue.