Open pablojimpas opened 2 years ago
Thanks for submitting this issue, @pablojimpas!
Maybe you've already seen the ClickHouse processor, which also inserts traffic in a SQL database. A Timescale processor wouldn't look much different.
I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?
We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:
tts_uplink_messages
. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?tts_uplink_messages
. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?tts_uplink_message_decoded_payloads
and tts_uplink_message_rx_metadata
tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.Let me know if you can pick this up.
I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?
Me neither, but I'm currently evaluating the use of Timescale for an integration and it's quite interesting. In summary, as far as I know now, all the standard SQL features should work since in the end it's just PostgreSQL, the interesting specific feature is that Timescale introduces the concept of "hypertables", optimized for time-series data. I'll continue learning about it this weekend, they have really good docs so it shouldn't be too hard.
We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:
* Storing RX metadata directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON? * Storing decoded payloads directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON? * Inserts into the "main" table automatically getting transformed and inserted into the `tts_uplink_message_decoded_payloads` and `tts_uplink_message_rx_metadata` tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.
Let me know if you can pick this up.
I don't know the specifics of ClickHouse either, it will be easier for me to port this if it was plain SQL. I will investigate further and see if I can make a solution work, I'll let you know what my findings are and contribute the implementation if I'm successful...I can definitely use some help though :)
FYI: So far I've tried using the data type JSONB
for the columns correlation_ids
, rx_metadata
and decoded_payload
and it seems to do the trick. I've a pretty very basic example working (not comparable to the ClickHouse one) but if you wan't I can make a PR to add it.
Thanks for looking into this. I have a couple of thoughts on it:
What do you think about using character varying[]
type for the correlation IDs instead of JSONB
?
How efficient is it to run queries over JSONB rx_metadata
compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?
Same question for decoded_payload
.
One concern I'd have with using JSONB
for decoded_payload
is related to encoding:
RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by
\uXXXX
. In the input function for thejson
type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow\u
). However, the input function forjsonb
is stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. Thejsonb
type also rejects\u0000
(because that cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character.
(source: https://www.postgresql.org/docs/current/datatype-json.html)
It's not uncommon for (decoded) payloads to contain \u0000
, so if we'd be using JSONB
we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingesting JSON
I think it's better to use that (at least for decoded_payload
).
Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.
- What do you think about using
character varying[]
type for the correlation IDs instead ofJSONB
?
You're right, using an array would be a better fit for that field.
How efficient is it to run queries over JSONB
rx_metadata
compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?Same question for
decoded_payload
.
I haven't run any benchmarks, I know the JSONB
type adds some overhead on inserts but then it's supposed to be quite efficient, you can even create indexes with them. However, I think this wouldn't be an issue for my use case at least, I plan to store the decoded payload as is, then I'll query the table to get hole JSONB
field and the business logic will figure out which json fields to look for and how to interpret them (I don't want to put that logic inside a benthos processor).
4. It's not uncommon for (decoded) payloads to contain
\u0000
, so if we'd be usingJSONB
we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingestingJSON
I think it's better to use that (at least fordecoded_payload
). Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.
I'm not sure I understand this issue, so far I haven't run into problems with encoding, the decoded payloads that I've tested so far look like this:
{
"soilK": 74,
"soilN": 22,
"soilP": 29,
"soilEC": 306,
"soilPH": 5.83,
"soilMoisture": 16.55,
"soilTemperature": 23.19
}
I'm not doing any fancy processing in the benthos processor like in the case of ClickHouse, I'm just storing the raw decoded payload doing:
uplink_message.decoded_payload.string()
Summary
It will be great to support an option to write uplinks to a time-series database such as TimescaleDB.
TimescaleDB it's just an extension of PostgreSQL so all the pieces needed are in place since Benthos' sql_insert already supports a
postgres
driver option.Use Case
This will be useful for almost any IoT use case concerned with monitoring since a time-series database some nice properties for such scenarios.
Implementation
No response
Testing
No response
Contributing
Code of Conduct