Write to TimescaleDB - Githubissues

pablojimpas commented 2 years ago

Summary

It will be great to support an option to write uplinks to a time-series database such as TimescaleDB.

TimescaleDB it's just an extension of PostgreSQL so all the pieces needed are in place since Benthos' sql_insert already supports a postgres driver option.

Use Case

This will be useful for almost any IoT use case concerned with monitoring since a time-series database some nice properties for such scenarios.

Implementation

No response

Testing

No response

Contributing

[X] I can do more research
[ ] I can implement the feature if I get some help
[ ] I can implement the feature without help (but I'll wait until the feature request is accepted)
[X] I can test the feature before it's released

Code of Conduct

[X] I agree to follow TTN's Community Code of Conduct

htdvisser commented 2 years ago

Thanks for submitting this issue, @pablojimpas!

Maybe you've already seen the ClickHouse processor, which also inserts traffic in a SQL database. A Timescale processor wouldn't look much different.

I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?

We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:

Storing RX metadata directly in the tts_uplink_messages. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?
Storing decoded payloads directly in the tts_uplink_messages. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?
Inserts into the "main" table automatically getting transformed and inserted into the tts_uplink_message_decoded_payloads and tts_uplink_message_rx_metadata tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.

Let me know if you can pick this up.

pablojimpas commented 2 years ago

I don't know all the ins and outs of TimescaleDB, so I'm not entirely sure what's the best way to design the database schema. Since you indicated that you can do more research, can you look into that?

Me neither, but I'm currently evaluating the use of Timescale for an integration and it's quite interesting. In summary, as far as I know now, all the standard SQL features should work since in the end it's just PostgreSQL, the interesting specific feature is that Timescale introduces the concept of "hypertables", optimized for time-series data. I'll continue learning about it this weekend, they have really good docs so it shouldn't be too hard.

We can take the schema for ClickHouse as a starting point, and find alternatives for ClickHouse specifics that TimescaleDB doesn't have or does differently:

* Storing RX metadata directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?

* Storing decoded payloads directly in the `tts_uplink_messages`. In ClickHouse we use Arrays for that, maybe in TimescaleDB that should be JSON?

* Inserts into the "main" table automatically getting transformed and inserted into the `tts_uplink_message_decoded_payloads` and `tts_uplink_message_rx_metadata` tables. In ClickHouse that's done with a materialized view. It looks like TimescaleDB does this in a similar way.

Let me know if you can pick this up.

I don't know the specifics of ClickHouse either, it will be easier for me to port this if it was plain SQL. I will investigate further and see if I can make a solution work, I'll let you know what my findings are and contribute the implementation if I'm successful...I can definitely use some help though :)

pablojimpas commented 2 years ago

FYI: So far I've tried using the data type JSONB for the columns correlation_ids, rx_metadata and decoded_payload and it seems to do the trick. I've a pretty very basic example working (not comparable to the ClickHouse one) but if you wan't I can make a PR to add it.

htdvisser commented 2 years ago

Thanks for looking into this. I have a couple of thoughts on it:

What do you think about using character varying[] type for the correlation IDs instead of JSONB?
How efficient is it to run queries over JSONB rx_metadata compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?
Same question for decoded_payload.
One concern I'd have with using JSONB for decoded_payload is related to encoding:

RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the input function for the json type, Unicode escapes are allowed regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input function for jsonb is stricter: it disallows Unicode escapes for characters that cannot be represented in the database encoding. The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character.

(source: https://www.postgresql.org/docs/current/datatype-json.html)

It's not uncommon for (decoded) payloads to contain \u0000, so if we'd be using JSONB we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingesting JSON I think it's better to use that (at least for decoded_payload).

Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.

pablojimpas commented 2 years ago

What do you think about using character varying[] type for the correlation IDs instead of JSONB?

You're right, using an array would be a better fit for that field.

How efficient is it to run queries over JSONB rx_metadata compared to having the processor insert the RX metadata to a separate table and querying from there? Or do you plan to do use a continuous query to extract the metadata from JSON into a separate table?

Same question for decoded_payload.

I haven't run any benchmarks, I know the JSONB type adds some overhead on inserts but then it's supposed to be quite efficient, you can even create indexes with them. However, I think this wouldn't be an issue for my use case at least, I plan to store the decoded payload as is, then I'll query the table to get hole JSONB field and the business logic will figure out which json fields to look for and how to interpret them (I don't want to put that logic inside a benthos processor).

4. It's not uncommon for (decoded) payloads to contain \u0000, so if we'd be using JSONB we wouldn't be able to store those payloads. If TimescaleDB would be just as happy ingesting JSON I think it's better to use that (at least for decoded_payload). Note that this may not be a big issue if the processor (like the ClickHouse processor) filters out non-numeric values.

I'm not sure I understand this issue, so far I haven't run into problems with encoding, the decoded payloads that I've tested so far look like this:

{
  "soilK": 74,
  "soilN": 22,
  "soilP": 29,
  "soilEC": 306,
  "soilPH": 5.83,
  "soilMoisture": 16.55,
  "soilTemperature": 23.19
}

I'm not doing any fancy processing in the benthos processor like in the case of ClickHouse, I'm just storing the raw decoded payload doing: uplink_message.decoded_payload.string()

TheThingsNetwork / lorawan-stack-application-cookbook

Write to TimescaleDB #9

Summary

Use Case

Implementation

Testing

Contributing

Code of Conduct