Open aptiko opened 4 years ago
Following a meeting today I'm making some more notes.
Above I have been using the terms "Variable type" (for what before was a "variable") and "Variable" (for a group of related time series). Because these terms are not finalized yet, and to be 100% clear on the meaning, I'm temporarily going to be using time series group or TSG for the group of related time series. A TSG essentially consists of several time series such as raw, checked, regularized, hourly, daily, and monthly. Processed time series (like discharge being derived from stage) shall be in a different TSG.
In addition, on-the-fly won't generally work, because the checked time series may have been corrected by a human.
So the first issue is how we shall name the TSG. Proposals so far include variable, process, and sensor/instrument.
The problem with variable is that it might confuse the user, who might expect "variable" to refer to something like "rainfall" rather than "rainfall from sensor 1". We'd need to rename what we currently call a "variable" to "variable type", which seems unintuitive. We wouldn't have this problem if we could ensure that each station has no more than one TSG per variable type, but while this is the most common case, it's not the only case.
Process has many problems. First, it is a highly technical mathematical term. Users not familiar with it would get confused. In addition, it would be inaccurate. Suppose you have two temperature sensors on a station. You'd then have two "processes". But there is actually only one process (the temperature), and the two sensors make two different series of measurements for that single process. Given that it's neither easily understandable nor accurate, I think it's a no-go.
Sensor/instrument has the problem that a processed TSG is not bound to a sensor or instrument. Let's say we have sensor "Stage sensor 1". This then has a stage TSG. If we use the term "sensor/instrument" for TSGs, then this stage TSG will be "Stage sensor 1". If we then process this and produce a discharge TSG, the discharge TSG can't possibly be "Stage sensor 1", since this is another TSG. Unless discharge goes in the same TSG as the stage. This could work in some cases (in fact I consider this in the initial description of this issue), but the general case is that a processed TSG might have many source TSGs. For example, you might have three stage measurement devices and derive a single discharge time series from the three stage time series (e.g. because in winter you need to use another one than in the summer).
One additional option that I propose is to use the term time series for the TSG, because raw, checked, regularized, hourly etc. are essentially different ways to look at the same data. So we call TSGs simply time series. The items that comprise the TSG (i.e. the raw, the checked, etc.) we can call time series versions; we can talk about "the raw version", "the checked version", "the regularized version", and the "hourly version" of a time series. It might not be 100% precise mathematically, but I think it's reasonably simple. The problem is that the word "version" is often used for other purposes. The visiting user won't come across the word "version": he'll be viewing a list of "time series" (i.e. TSGs), and when asking to download data, he'll be presented with a menu containing "raw", "checked", etc. However, the station administrator may need to see the word "version" (e.g. when specifying the source time series version and the target time series version of a processing), and the term will definitely be in the Python code, where it could be quite confusing. (Note: the problem of multiple meanings of "version" does not exist in Greek, where the most appropriate translation for this case seems to be εκδοχή.) Alternatives to the word "version": "view" (this is even worse, but I mention it for brainstorming), "form" (even worse than "form", if possible, because TimeseriesForm
in the code would definitely mean a web form), "representation" (which I find unintuitive).
Some point to view on:
With respect to the variable variable notion of an observation, this could be replace with the proper term random variable as it is common in the literature.
The internal/coding word for coding/description is derivative however this was externally used be the domain experts, hence I propose to attach a label (suffix or prefix i.e. adjective) to disambiguate. In summary for modelling purposes and optional for user purposes you can use the notion _typeof/from . For instance for Time Aggregation type (with optional argument for the window), is of type_of aggregation because it operates on time variable while quantization operates on the value etc. Some they call it filtering in the brad sense.
You don't mention the problem you are trying to solve. As a matter of fact this is a modelling issue/problem. Django with its native MVC characteristics is pushing hard on adopting a modelling quick enough and avoids a lot of problem. However here you are/were in dire need of an UML use case description.
Respective usage/terminology from big G [https://cloud.google.com/monitoring/api/v3/aggregation]
@dkalog While a meteorological or hydrological variable probably is a random variable, using the term "random variable" would probably confuse most users (and most programmers). When we visit a site with data we want to retrieve the data for some variables—we don't have a statistical context in mind at that time.
It's also not only a modelling problem (btw, I'm not familiar with UML, so I have no opinion on how useful it would be). It's also a problem of what we tell the user.
For example, here's an issue that just came up while @ad0v0 is improving the look of the pages: currently when you visit the detail page for a time series, it shows you a chart. If there is no chart, in some cases (not clear which) it shows the message "No data locally available!". This is confusing, of course—it should just say "No data". What if we want to make the message more human? "There is currently no data in this time series" is much better I think.
There is currently no data in this time series (decent but not 100% accurate) There is currently no data for this variable (decent but can be confusing if there are two TSGs with the same variable type) There is currently no data for this process (quite confusing) There is currently no data for this sensor (very good but it doesn't work for processed TSGs)
Regarding the @ad0v0 I don't see as a major problem but in case you' d like to play with marginally existing issues, "The 1st version like There is currently no data in this time series (decent but not 100% accurate)... is good enough for the majority of users, as long as the Y axis reports Time Series for ... My previous reference to the issue was point 2 where you need some label (metadata) adjective describing this derivative Timeseries.
I dont no why but I loved the term TSG
know
Following a second meeting today, here's the decision:
The term "time series group" will be avoided on the UI. We can use "data" instead. The message that can be shown in place of the chart when the time series of the group have no data will be "no data". If we want something longer, "there is no data in this time series" might also work, even if not 100% accurate.
I've just reviewed this issue and... I don't think I'd have been able to meaningfully contribute to this discussion, as I may not understand properly the whole proccess behind messages you guys wrote. Though, in general I think I've got the point: we're discussing the messages/text/cues that will be displayed for users and how to, so to say, "automatize" these messages' generation?
If you think I can be of any help to you or you need anything from my side, let me know!
Thanks @ad0v0. We discussed on the messages that will be displayed to users, however the purpose for that isn't just the messages. Before going on and make substantial changes in the modeling, I wanted to be certain that we all understand what it is that we're modeling. Very often trying to find a good name for something results in redesigning the code.
Essentially these are design notes.
Right now we have a station (e.g. Agios Spyridonas), and the station contains time series (e.g. "Rainfall measured by sensor 1". Each time series has a foreign key to a variable (e.g. "Rainfall").
First of all we change terminology. Instead of "time series", a station now has "variables". I.e. "Rainfall measured by sensor 1" is a variable (until now called a "time series"). Each variable shall refer to a variable type (until now called a "variable"). So "Rainfall" is a variable type.
The user will visit the station "Agios Spyridonas" and will be seeing a list of variables. In most cases, there will be a single variable with type "Rainfall". But if the station has, for example, two rainfall sensors, then it will have two "Rainfall" variables.
When the user visits the detail page for a variable, he will be seeing the chart etc., and he will have the option of downloading data. He is going to be given the option of downloading the raw data, or the automatically checked data, or the aggregated hourly, daily or monthly data. We call these different options time series. That is, "daily aggregated rainfall measured by sensor 1" is a time series that corresponds to the variable "rainfall measured by sensor 1".
We may have different requirements for hydroscope, openhi, and openmeteo. For openhi, the raw time step is practically always less than an hour, and the required time series are raw, checked, regularized, hourly, daily, monthly. All these time series can be produced on the fly from raw.
For hydroscope, the raw time step is often daily. There's no checking or regularizing (but it doesn't do any harm to produce these, or to produce monthly).
Openmeteo must be able to do everything.
Possibly a good idea is to keep Enhydris simple, that is, for each variable to have a single "raw" time series. (This means that Enhydris needn't have the notion of "time series" or "raw", since there will only ever be one of those.) Like it is today.
At the moment the parameters for automatic checking must be specified for each variable. So, if a variable is to be automatically checked, the user must specify autocheck parameters in the admin—essentially what he already does today, but without the need to specify a target time series. If there are many checks, they can be done sequentially; first range checking, then time consistency.
Regularization and aggregation to daily and monthly can be always available (hardwired), (at least when enhydris-autoprocess is installed). The starting point for regularization will be the checked timeseries. If checking parameters have not been specified, checked, regularized and aggregated time series will not be available.
In Enhydris, variables will have a single "download data" button (as it is today); and when enhydris-autoprocess is installed it can be modifying/extending the button to provide more options as to which exactly time series to download. It will be doing such extension only if checking parameters have been provided.
The TimeseriesData API view must also be extended. It needs a "timeseries=XXX" parameter to specify which timeseries to get (the URL for the download data request might be the only place where the term "time series" appears to the user). If the parameter is unspecified, the raw time series shall be gotten (exactly as if enhydris_autprocess is not installed).
The simplest way to go with curve interpolation is probably also on the fly. The source time series for curve interpolation will be the regularized time series. For a "stage" variable, the "download data" button can list the following options:
The problem this scheme has is that it does not provide for manual corrections. But maybe it's acceptable for a start. Corrections can be provided in the future by modifying the enhydris_timeseriesdata database table and adding some kind of "manually corrected value" column.