DOV-Vlaanderen / pydov

Python package to retrieve data from Databank Ondergrond Vlaanderen (DOV)
https://pydov.readthedocs.io/en/latest/
MIT License
31 stars 19 forks source link

First example/idea of the aimed functionality, setup #1

Closed stijnvanhoey closed 6 years ago

stijnvanhoey commented 7 years ago

As a functionality, the user would could like this (naming should be improved to better fit the naming in the groundwater domain):

import dov-downloader as dov
dov.download([list of wel]).subset_period('2000':'2007').to_csv("name_file.csv")

(in words: download my list of wells, filter that specific period and write everything into a csv-file)

Basically, there are 3 parts in this setup:

  1. download, i.e. extraction part: downloading data based on a list of stations; this part could be extended towards more powerful download_**** function, e.g. download_from_boundingbox, download_from_aquifer(),... These extension functions of the regular download will always require soma additional service calls, but will end up having a list of stations and use the download function
  2. subset_*, i.e. filter part: this should provide some straightforward functions to filter the downloaded data set. When using pandas DataFrames as the basic data type to store the data (see further), a lot of options will be available.
  3. to_***, i.e. conversion part: The data is stored or exported to a new file-format that could be useful for the user. to_csv/ to_excel are exampled that are already available, but the advantages of this package would be if there are more domain-specific export funtionalities, e.g. to_modflow(), to_menyanthes(), to_swap()

As we're dealing with time series, the usage of Pandas DataFrames as used datatype, provides a lot of built-in options. When needed, we make a new class inhereted on the pd.DataFrame to handle some additional metadata. Multiple stations can be solved by having a Multi-index as column headers. With the row labels as a DateTimeIndex, we have all the data handling options like resampling (daily/monthly/... mean values) and slicng data from Pandas available.

The fact that we will have the XML-format as such (always a complete time serie) as the stable source for data, I would propose to have an xml_to_df conversion function that converts the XML to a Pandas DataFrame as a basic function in direct relation with the other basic functionality download. These two functions (xml_to_df and download) could be the first milestone to implement. Than, more advanced download functions and more advanced export functions can be created ont op of this.

johanvdw commented 7 years ago

We can not convert the data of one screen (filter) to only one dataframe. If we have a list of screen I think we actually have three dataframes: one with the screen properties (id, x, y, ...) a second one with water level observations (date + level + some other fields) and a third frame with groundwater quality data.

I agree using pandas is the way forward.

stijnvanhoey commented 7 years ago

Indeed, I do agree that we can not put all the information in one dataframe, but will have to get a class with the different dataframes as attribute. Cfr. the setup branch on my fork, https://github.com/stijnvanhoey/pydov/blob/setup/pydov/dovseries.py which can be interpreted as a first draft implementation of this concept. I have to check the new data format and adapt the code next week.

johanvdw commented 7 years ago

Ok - I was not aware of that branch. Perhaps one comment: we better call the class DovGroundwater/DovGrondwater, as we will be adding similar objects for other objects.

stijnvanhoey commented 6 years ago

Closing here, check #18 for further discussion