VForWaTer / metacatalog

Modular metadata management platform for environmental data.
https://vforwater.github.io/metacatalog
GNU General Public License v3.0
3 stars 1 forks source link

Generic Table for Array-Type Data #124

Closed mmaelicke closed 3 years ago

mmaelicke commented 3 years ago

@AlexDo1 please check out this example I quickly wrote together: https://gist.github.com/mmaelicke/fe9d18c766424db34bb0177a619bf5e0

Locally you can import the install function and create a new database called test and a user called test with password test and run the install function.

if you install fire (pip install fire) you can run it like:

python array_test.py upload --dims=3 --n=100

and download the data using the ID printed out by the command above as:

python array_test.py read --meta-id=1

This solves the problem of storing ND data in a single table (and other tables of same structure) and it solves the problem o storing the mappng.

For the column map there are several possiblities:

  1. The colmap would become a mandatory field on the Datasource
  2. we map the columns by default to value and [value1, value2, ...], but if the Detail _col_mapping, we use that instead
  3. the colmapping is a optional field on Datasource that replaces the default behavior from 2.

What do you think? Something that I should do or do you wanna go for it?

AlexDo1 commented 3 years ago

The script you wrote works and returns the 3D data with column names. I would like to work on it as I think it would be a good exercise, but it might take some time and we might need to talk about it again if I get stuck somewhere.

  1. I think it would not be bad if the colmap was a mandatory field on the Datasource, as this would perhaps lead to some consistency and clarity, and there would always be a definition of the column names.
  2. This would be more flexible, but the column names should always be known, so it should also not be a problem to simply specify these column names in the Datasource.
  3. So if the colmapping is defined in the datasource we use this information, if not we check if the key _colmapping exists in the details and if it exists we use this information. If the colmapping information does not exist anywhere, we use value and [value1, value2, ...] again? I think this is not a very consistent behavior, because sometimes we get column names when we export the data, and sometimes we don`t.

I think in general I am more of a fan of using the Datasource for the colmapping information than the details, since the column names are clearly related to the datasource. The optional use of the _colmapping key in the details feels a bit like a workaround and it might be harder for new users to find this option in the first place.

One problem with letting the user decide freely which column names to use could be that e.g. temperature data could sometimes have different column names like T, T_degC, T[K] and so on.

mmaelicke commented 3 years ago

The script you wrote works and returns the 3D data with column names. I would like to work on it as I think it would be a good exercise, but it might take some time and we might need to talk about it again if I get stuck somewhere.

Great! Then you can just go ahead and create a new branch for this implementation. Try to create it asap, and always push the commits to the branch, that makes it easier for me to track the progress and help where necessary.

[...] I think this is not a very consistent behavior, because sometimes we get column names when we export the data, and sometimes we don`t.

I agree. What we need to remind us of, is that the revision of the database needs to somehow process the data from existing datasources and their metadata. Maybe we need a utils submodule that implements a function to derive the columnames from the Variable or what. This is needed to updated existing databases and the data in there. We can't just drop and reinstall.

I think in general I am more of a fan of using the Datasource for the colmapping information than the details, since the column names are clearly related to the datasource. The optional use of the _colmapping key in the details feels a bit like a workaround and it might be harder for new users to find this option in the first place.

Agree. Then we create a new field called data_column_names. Remind that not all data in metacatalog is necessarily store in the very same database, or is of a column-type. So maybe data_dimension_names is a more general description. @AlexDo1, even better suggestions are more than welcome.

One problem with letting the user decide freely which column names to use could be that e.g. temperature data could sometimes have different column names like T, T_degC, T[K] and so on.

I would argue, that this is intended behavior. The Entry references a Variable anyway, which is mandatory. Therefore, the information, what the data is all about is stored anyway. The idea of the colmapping is that the data is returned, pretty much as it was passed. Another, maybe better, path would be to append a new column to Variable, that stores the default column mapping for each variable. The importer function can then use this and store it to the Datasource by default. With the **kwargs, the user has the possibility to overwrite this behavior and set a use_colmun_names=True argument on Entry.import_data, which will force to use the column names from the dataframe over the default names from the Variable. The Entry can do this automatically if the length of Variable.column_name and the amount of columns in the dataframe do not match. Then, the column names from teh dataframe are used anyway. This way the default column name for the current wind speed would be wind_speed. For eddy and other 3d wind products, the class will use the dataframe as the amount of columns does not match.