Open AndyClifton opened 1 year ago
Thanks @AndyClifton for posing the question.
I understand the benefits of a self describing file so am interested to explore this.
I suppose I have two thoughts so far;
Cheers,
Re your questions...
Sorry, what I mean when talking about space is not for actual storage but in transmitting the data e.g. via a web api. Ideally you'd want it to be compact so it can be transferred quickly.
Formally, I don't believe we have any sort of recommendation on how to combine it into a single file. Pre-fixing the json header into the CSV seems like one option but parsing it would require a little more complicated parser than reading in a csv or json individually. This does seem like the format many OEM are using as well.
Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the stat
) as opposed to the measurement name which does not specify the stat
.
@stephenholleran :
Sorry, what I mean when talking about space is not for actual storage but in transmitting the data e.g. via a web api. Ideally you'd want it to be compact so it can be transferred quickly.
Understood. I think this depends on the use case. We only anticipate doing this bundling of data and metadata for specific purposes, and the main one is a user requesting an export of their data so they can use it in another application. Then it would be fine if a call was made and the user informed later that the data was ready for download. So, I don't think that speed is too much of a challenge. But, I guess the problem could be if the API call is fast enough, it would be tempting to use it more frequently.
Because this seems like a UI / UX issue, I would say its worth setting out how the data model could include data, and worry about the implementation challenges separately?
@abohara: thanks for this.
Formally, I don't believe we have any sort of recommendation on how to combine it into a single file. Pre-fixing the json header into the CSV seems like one option but parsing it would require a little more complicated parser than reading in a csv or json individually. This does seem like the format many OEM are using as well.
Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the
stat
) as opposed to the measurement name which does not specify thestat
.
Do you have an example of what this could look like?
@AndyClifton I know you're talking about lidars but Kintech Orbit 360 logger for met mast is a good example of metadata in json in the header and time series in the file's body. Still, such files need to be parsed in IEA task 43 format.
Sorry for the delay, I was at WESC last week.
From @abohara:
Regarding how the data model ( in json format ) ties into the table of measurement (e.g. in csv format ), the natural way for them to link is through the data columns table ( which also specifies the
stat
) as opposed to the measurement name which does not specify thestat
.
This can work however it just makes working with the data a little trickier if there was a certain type of config change, for example an 80m anemometer been moved to 100m but it is still connected to Ch1. The Ch1 column in the data now means two different things (a wind speed at 80m for the first part and then a wind speed at 100m for the second part). The data model captures the config change so all the info is there to break up that column and rename it something else. What we do is assemble the data taking into account these config changes and so then use the measurement point name as the column heading. To get around the lack of a stat
we just append the stat enum onto the end. I know not a great solution but it works.
So there are two methods you can link the metadata to the timeseries data. The difference is just when the assembling of the data is done.
From @AndyClifton :
But, I guess the problem could be if the API call is fast enough, it would be tempting to use it more frequently.
Because this seems like a UI / UX issue, I would say its worth setting out how the data model could include data, and worry about the implementation challenges separately?
People are impatient @AndyClifton, they'll want the data immediately 😄 . There are several ways to capture tabular data in JSON Schema, we should consider data transfer speeds when picking one.
Of the possible JSON formats for storing table data I would probably lean towards one of the formats that pandas outputs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html . I know we have used one of them in the past but I can't remember which. This would make it easy for the user to load the JSON back into a pandas dataframe or something similar.
filename and checksum would be possible options to keep original format and adding metadata. We've done that and feels comfortable and retro compatibility.
Anyway , if data is updated then it gets broken and checksum needs to be updated.
Hey there,
If you haven't encountered it yet, TimescaleDB
might be useful as you have already linked a PostgreSQL
schema.
It's an extension to PostgreSQL
that enables storing timeseries data in your Postgres
database alongside your relational database metadata in a special table format called a Hypertable
.
We're currently migrating to it at
Mainstream Rewewable Power
from our legacy multiple-database system (one for timeseries & one for metadata). We're usingDjango
,Django Rest Framework
&psycopg2
as a glue for combining timeseries & metadata for exporting to various formats
Thanks @rdmolony,
Yeah, that could be a useful for an extension of the SQL create table statements.
Just a note of caution with timeseries databases. They are mostly geared towards retrieving the most recent time based data and not what wind analysts want which is the whole dataset in one go. They do shard the tables based on time, keeping the most recent in memory for fast retrieval which is great if you are wanting the last 5 days for data monitoring but maybe not so fast for 2 years of data. It all depends on what hardware you can throw at it (i.e. cost) and how you configure it too and as you say TimescaleDB does give you extra benefits by been integrated into Postgres.
Thanks also for contributing in yesterday's workshop and it is great to hear that Mainstream are further aligning with the data model.
There's no such thing as a free lunch! As always, it depends on your use case!
If you're interested in our specific quirks so far with this tech, some notes -
Our goal of migrating was simplifying our systems. We are still storing our timeseries readings in
Microsoft SQL Server
& our metadata inPostgreSQL
& combining the two inpandas
. It's messy.
Rather than storing row-by-row as is typical in a relational database, these tables store in chunks & so are optimised for bulk reads. However, querying is not as fast as an in-memory
DataFrame
library likepandas
because it doesn't load the entire query into memory. For the same reason it does scale to v large datasets.For 11 years of 50 sensor readings for every 10s (on our self-hosted Windows machine with 32GB RAM | 2.7GH\ Intel Xeon Platinum 8168 CPU); a simple select takes 24s & calibrating + selecting takes 1m10s
Compressed Hypertables assumes that you're only ingesting recent data (last month or week or day). To ingest an "older" data source you have to decompress chunks before insertion which can be really slow for inserting say 10y!
Ideally I'd like to cache calibrated readings in another compressed hypertable, however, it's slow to update compressed hypertables so this isn't useful yet.
Windographer
exports is surprisingly hard in Postgres
/Python
There is the builtin function
crosstab
, however, this is very slow on exporting calibrated readings on top of the existing computation. As an alternative we can also stream over the query result inPython
& pivot chunk by chunk - a bit messy ...
In our case this compressed hypertable ingest constraint is "okay" for ingesting our logger data but not so convenient for 3rd party data sources. To work around the query slowness on calibrated readings we we run periodic jobs to dump calibrated readings to files.
EDIT 1: I looped in the TimescaleDB
team for comment -
(@jonatas) Very interesting challenge! thanks for mentioning Timescaledb :timescale-mark: Also, I see a great potential for continuous aggregates and avoid touching old data at all. If data is computed to be accessed, probably a cagg can replace the raw data. Another advantage is that parallelization will always speed up the queries compared to regular postgresql tables. Loving to learn from the thread
EDIT 2: Calibrating generic readings approximately doubles the query time as Postgres
has to make two round trips for a SELECT *
; one to fetch generic readings & another to recalibrate these readings like (readings.generic_reading - meta.logger_offset) * (meta.certificate_slope / meta.logger_slope) + meta.certificate_offset
(see the TimescaleDB
slack discussion above for more info)
Thanks @rdmolony, brilliant info there on how you are solving this.
One comment, based on the fact that Windographer will very soon be able to import the data model, why try and prepare the data i.e. calibrate it, before loading it into Windographer? In theory, Windographer should be able to apply the calibrations for you as the adjustments required should all be contained in the meta data. This could potentially save you the headache of trying to provide calibrated data. Just a thought and we are digressing from the opening question about how the data model could accommodate the timeseries data. :) Sorry everyone else!
EDIT: This timescaleDB discussion is now moved to #227.
Also, @rdmolony if you are interested in joining our Task 43 wind resource data model working group we would love to have you? Experience from someone who is implementing the data model would be very useful and as you are a software engineer your perspective on changes going forward could be valuable. We have a 1 hour call every second Thursday at 4 pm Irish time. There will be no homework if you don't want to, it is mostly turn up and contribute within that 1 hour.
Moving to #227 @stephenholleran!
Sure, add me to the email thread or however these things are organised :)
Hi folks,
We're looking at using the data model as the basis for a single file that contains both metadata and measurements from a met mast or wind lidar.
The reason for this is, to keep measurements together with metadata and so allow continuity when data are transferred from person to person, or from process to process, or between organisations. In our experience, any approach that separates metadata from data runs the risk of losing the link. Our goal is to be able to create a single file that contains the data the user/process requires as an export from a file collection, database, web app or whatever.
As we understand it, currently the data model does not have a way to store time series data. (Thanks @stephenholleran for the patience / confirmation)
🤔 How would you extend / modify the data model to include time series data?