Historical data from Ditto

ghost commented 4 years ago

We are happy users of Ditto and I'd like to share how we use it to see if there is value in extending Ditto to support historical data. (I do not want to imply any desired timeline.)

We use Ditto to provide the last state of devices. Because Ditto does not make historical data of devices available, we run an InfluxDB with a custom service to feed data from Ditto into InfluxDB and to provide the data from InfluxDB to other services, e.g. all features' values since the device was added.

I wonder whether other users of Ditto do similar things and whether it would be desirable to make historical data available directly from Ditto.

I am aware of #174 where @mhumoglu showed interest in such functionality and where @thjaeckle mentioned that historical values are available in the database.

thjaeckle commented 4 years ago

Hi Sebastian.

Indeed the historical values can be available in the database. By default they are cleaned up - if not configured otherwise via the environment variable PERSISTENCE_CLEANUP_ENABLED available in "concierge" service: https://github.com/eclipse/ditto/blob/master/services/concierge/starter/src/main/resources/concierge.conf#L78

The historical values are however a "side effect" of the persistence approach Ditto uses. It uses Akka Persistence which itself is built on the Event Sourcing pattern. With that pattern there are only inserts done to the database, never updates - that way all changes to e.g. "things" managed in Ditto are "delta changes" persisted as separate documents in MongoDB.

So when never deleting these old events, the complete history may be available. As said in #174 Ditto however does not have an API to retrieve those historical values.

There are some challenges in order to get this right:

the Akka Persistence plugin may return a stream of persisted events for a single thing Id
- the format of this events is however internal, not supposed to be API
- they could however be transformed to "Ditto Protocol" messages which are public API
so APIs would have to be defined
- in HTTP and Ditto Protcol (e.g. via WebSocket, AMQP 1.0, etc.)
authorization would have to be specified/clarified
- e.g.: which authenticated user/system is allowed to see historical values? is that defined in the policy? is "READ" from the policy sufficient?
- or even: must the configured authorization at the time where a historical entry was added be the one to apply for accessing the historical event?
the event history would have to be streamed in the cluster message by message

So there are quite a few things to do. And to be honest: the team does not have that on the Ditto agenda, so we would rely on contributions.

And I don't know if this would really help you solve your problem. That would not at all be comparable with a timeseries database like InfluxDB. The max. you would get from this API is for one or several things a stream of change events - without any aggregation or statistic calculation. Is that something which would help you?

What probably could be interesting for you is the recently added "HTTP Push" connectivity feature (https://www.eclipse.org/ditto/2019-10-17-http-connectivity.html) which lets you invoke a custom HTTP endpoint e.g. for each "twin event" which was done. In combination with JavaScript based payload transformation you could probably get rid of your "custom service" between Ditto and InfluxDB and directly push data in a normalized form via HTTP to the InfluxDB API.

demetz commented 4 years ago

@selobosi we are also currently looking into a) how to feed data to Influx for persistence and b) extracting it again, considering the policy.

In my opionion, the two biggest issues are:

proper mapping, i.e. how to relate the fields / tags to the ditto model. Given that we have our things definitions in vorto, we are using the measurementUnit to define the name of the measurement the values should be written to in Influx. Every data point is tagged with the thingId as well as the path of the data point.
handle access to the data as Influx does not support policies. Initially, we were considering to provide the policy as an additional tag. However, the policy could change over time. Furthermore, this will be much more difficult to implement if you are using Flux as query-language as oppsed to InfluxQL.

thjaeckle commented 4 years ago

Just yesterday I was watching an intro video of Apache IoTDB which is a new project intended for storing time series data suited for IoT use cases. I don't believe that eg InfluxDB or any other conventional time series database is a good fit for massive sensor data.

IoTDB might be worth looking into.

@demetz I think somehow making the authorization via policies work with ts data would be the hardest thing to solve.

What I once thought of: just store the "read subjects" of an thing Event as tags to each ts document. The "read subjects" is a list of policy subjects allowed to read an event (eg thing modified) and is already available for each event in Ditto. That way a query to the time series database could always include a "where clause" selecting the "Auth subjects" (authenticated subjects, eg a username) and the ts db would only return authorized datasets.

That Approach however would require Ditto to have a deep understanding and integration of a timeseries db. That is currently not on the project's agenda, but we're open for contributions. :wink: But first a good fit timeseries db would have to be agreed on.

jixuan1989 commented 4 years ago

Hi, I come from IoTDB community. Lukas Ott and Julian Feinhauer told me about Ditto.

Indeed in one of our practice, we tried to use an IoTDB instance to manage ～ 600 devices with 6000 metrics (i.e., totally 3.6M time series) with 5Hz data ingestion per time series (i.e., 3.6M*5 points per second). Actually, we tried 800 devices X 8000 metrics, it also works. All metrics generate float or boolean data.

However, we used a server which has 128GB memory (-Xmx100g) and an Intel XEON E5 CPU and HDD disk. (Using such a hardware, IoTDB can reach ~ 30Million points per second ingestion throughput, using batch_write api.)

We also tried to manage timeseries with high collection frequency, e.g., 500KHz per time series. If we use batch_write api, e.g., write 100 points per api calling, we can support such a high frequency data collection.

Of course IoTDB can also run on a Raspberry PI. But I did not test how many time series it can support on such a 2GB memory machine (maybe 20K~100K time series is fine.)

I do not know what scenario that Ditto wants to support and which kind of hardware that Ditto wants to run on. Hope the above info can help us to evaluate whether IoTDB is suitable.

Besides, IoTDB can support a good query performance for a time range query. (Indeed we are improving our query performance and may finish it in 0.10 or 0.11). A known issue is that if there are too many data in IoTDB's memory, the query may be slower than the data on disk for some kinds of queries (very strange, right? It is a design problem, and may be solved in future).

One thing you need to know is that IoTDB's schema is quite different with InfluxDB. we have a concept of Storage Group, which impacts the write performance and memory cost heavily.

A good thing is that as IoTDB is written in Java, it is glad to find Ditto is also written in Java that can make the integration easier.

thjaeckle commented 4 years ago

I think it could really be beneficial to look into the integration of IoTDB into Ditto for historical data.

I however don't have the time on working on an implementation or PoC for that as the historical data is nothing we have on our immediate OpenSource and also commercial agenda for Ditto. I could support when questions arise and maybe do a little PoC during my "lab day time" which we get from time to time to experiment with new technology, but that won't be sufficient to get this into a full fledged feature into Ditto.

As starting point when someone wants to proceed evaluation:

best would be to copy the "things-search" service module into a new "history" service module
things-search already subscribes to all ThingEvents and updates a separate search index which is very similar to what an integration with IoTDB could look like
adjust the code from there on, hopefully utilizing already established pattern ins "things-search", e.g. for missed events in the cluster

JulianFeinauer commented 4 years ago

Hi, just moving over from #592 (thanks for the hint @thjaeckle). To your last question on #592 , there currently is no equivalent to "tags" (as e.g. InfluxDB use it) to describe authorization. Currently this would be done on series level not on single points. For many use cases I know, that seems reasonable enough. Perhaps, if I find some time the next days, I could try to spin up a very example "gateway" which translates the journal events to an IoTDB instance.

JulianFeinauer commented 4 years ago

Just a short update here. I am currently running a working example to combine IoTDB as "history" with Ditto, see the Diagram: architecture

Major Drawback is the management of different Users at different points (MQTT, Ditto, IoTDB). So I would be highly interested in creating a service in ditto which "proxies" all timeseries requests and applies its security context / policy.

@thjaeckle could you see something like that as a ditto feature? And if so, where would be a good starting point from your perspective? @jixuan1989 I will discuss this with you also!

w4tsn commented 4 years ago

We also have a history connected to our things, although we use different tooling.

ditto-historian-diagram

The components involved are Eclipse Vorto, A Micro-Service serving the two purposes for storing and retrieving history, influxdb, an MQTT-5 connector and a special, configurable proxy serving the actual data from the db.

In our solution the history is managed through a micro-service connected to ditto via websockets which has two main purposes: get incoming changes to the digital twin, store it in influxdb and listen to ditto messages to retrieve data from influxdb.

In order to not send big historical queries through ditto (which fails due to limits, is slow due to missing compression etc.) the historian places a query in a configurable proxy with access to influxdb and returns only a temporary link with a Status Code of 303. This way the client can request history under it's policy restrictions through ditto while being able to get high performant queries (almost) directly from influxdb.

Access control is managed mainly through ditto policies. The policy controls what the historian may store through permissions on features:/. Any client via ditto is restricted via policies in the way what history it may retrieve. The historian stores incoming twin changes with their thingid, namespace and property path. With the explicit and implicit information known from the websocket connection the historian may handle requests as intended by the policies. E.g. explicit are thingid and propertyPath and implicit is where the ditto policy allows a subject to send messages to. For deeper application of policies we may incorporate the Policy API and even read policy objects from ditto. MQTT5 allows us to restrict gateways to only write data to things they should via their client ids which is not possible with MQTT3.

We try to maintain every twin related access through ditto so we can relay on dittos policy framework. Either we query features from a thing or we get additional functionality like the history by additional micro-services answering ditto messages.

I hope this already helps with the issue and gives another idea about practical usage of history and ditto :)

JulianFeinauer commented 4 years ago

Thanks for the detailed write up @w4tsn! Your setup looks well thought out. One question that remains for me is what kind of messages you use and how you restrict them? Do you use ditto messages with arbitrary payload?

thjaeckle commented 4 years ago

@w4tsn

the historian places a query in a configurable proxy with access to influxdb and returns only a temporary link with a Status Code of 303. This way the client can request history under it's policy restrictions through ditto while being able to get high performant queries (almost) directly from influxdb.

That is really clever. Kudos to that solution.

Could Ditto make this easier in any way? Basically what @JulianFeinauer thought about as well, eg providing a specific history API.

Relaying the payload through Ditto is really something which should be prevented here, so the redirect to a temporary URL is great.

It would be great to document this as some kind of "Pattern" or best practice for integrating 3rd party aspects of a digital twin via Ditto, still ensuring authorization.

JulianFeinauer commented 4 years ago

@thjaeckle @w4tsn indeed. This would be one idea. Another idea could be to add an API in ditto which allows others to use dittos auth mechanism to check specific requests.

thjaeckle commented 4 years ago

@JulianFeinauer rather than adding such an API via http I would recommend using the policy and the policy Enforcer and evaluate the outcome locally without remote request. That is already possible, the PolicyEnforcers are part of the Java API: https://github.com/eclipse/ditto/blob/master/model/enforcers/src/main/java/org/eclipse/ditto/model/enforcers/PolicyEnforcers.java

You just provide a Policy (eg retrieved via HTTP) and then call methods like hasUnrestrictedPermissions, even for resources not known to Ditto and also other permissions in addition to READ and WRITE.

JulianFeinauer commented 4 years ago

@thjaeckle this is a neat idea, indeed. So one could fetch a Policy and check against existing subjects to see whether they should be allowed to do... things or not?

thjaeckle commented 4 years ago

@JulianFeinauer yes, exactly. Same api is used in Ditto backend 😉

ffendt commented 4 years ago

Just a thought about the history of things and the policy enforcement for it: Do I want the user to only see history of a thing from timestamp x to timestamp y? Do I want to allow the user to see history of a thing for different periods (E.g. x1 to y1, x2 to y2, ...)?

From my experience I would affirm both questions. A simple solution to this would be to use the policy at timestamp x when getting the data of a thing at timestamp x (e.g. simplified by adding the read subjects to the historical entry of things as @thjaeckle assumed). Users will however only be able to see the history from the point they got added to the policy. Maybe you would want them to see also past this point in time. That's where we would probably need to enhance the policy structure.

thjaeckle commented 4 years ago

@ffendt I assumed (for simplicity) that the current policy allows a user to read the complete history or nothing.

If we want to apply the policy also for historical records, I guess the best way would be to add this information to the timeseries database when inserting the data. And building a query which includes the user "subject" performing the query. But as I understood, e.g. IoTDB does not have means to insert additional "tags" where such information could be put.

JulianFeinauer commented 4 years ago

I agree with @ffendt as best solution. And yes, @thjaeckle is right. But one could also mimic that with an additional layer where one creates a separate iotdb series for each policy... Then this would work.

w4tsn commented 4 years ago

Thanks @thjaeckle, that means a lot to me. I could write a first draft down in a PR and you may move it to a section of your choosing in the docs and enrich it with more details from your perspective. I wanted to write something for my blog anyways :)

Regarding an enhancement of the API from Ditto I'll have a coffee about it and talk with my frontend engineers if they have any 'desires' in that regards :+1:

Regarding policies and historical access, here is my current perspective:

Writing policy information as metadata onto a series is often times also not reversible (in an easy way) and probably also not desired. This implicates that a specific access state is known at write time of the datapoint and will never change in the future. So in order to later change access permissions of a series one would have to change the subjects of 'historical policies'. That means we'd have to store the history of policies, but allow a change of subjects them. Correct me, if I'm wandering off somewhere... @JulianFeinauer I think with additional series in iotdb it's the same? I'm not familiar with iotdbs mechanics so I struggling to understand the implications of separate series per policy. Maybe you could elaborate about that a bit more?

Having policies with an optional timestamp extension for historical data feels much more natural to me, since the policy then is always a current perspective, which may grant/revoke access to the history from a somewhat elevated position.

@ffendt in our scenario, currently, a user either has access to the history or not. There currently are no plans to further restrict this on a per timestamp basis, but who knows when the next feature request comes in. I'm however open for ideas and besides that this topic is quite fascinating.

Oh, and regarding the messages we use @JulianFeinauer:

Indeed we use ditto messages over the live channel with arbitrary payloads which are captured, processed and answered by the micro-service. The payloads (as JSON) are what actually defines the parameters of the query (without the tags, db names and measurements required for authorization). @thjaeckle I suppose that this is one of the things that would be moved to a HTTP API, instead of using loose payloads. The frontend team misses the swagger docs and RESTiness when using our ditto messages APIs :sweat_smile:

thjaeckle commented 4 years ago

@w4tsn

I could write a first draft down in a PR and you may move it to a section of your choosing in the docs and enrich it with more details from your perspective. I wanted to write something for my blog anyways :)

That would be great 👍

@thjaeckle I suppose that this is one of the things that would be moved to a HTTP API, instead of using loose payloads. The frontend team misses the swagger docs and RESTiness when using our ditto messages APIs 😅

Maybe we can come up with a descriptive approach where a piece of configuration would define which HTTP APIs are provided, basically as a kind of "facade" in front of the messages HTTP endpoints.

JulianFeinauer commented 4 years ago

I started a Discussion for the IoTDB Approach (in parallel to what @w4tsn suggested). We already had a short discussion that one could extend the Policy Framework to have also a "READ_HISTORY" permission Flag. As @thjaeckle suggested one could use the IoTDB Intersector with the PolicyEnforcer to reach the goal via native IoTDB Clients.

See here for the IoTDB Issue: https://issues.apache.org/jira/browse/IOTDB-648

jixuan1989 commented 4 years ago

Hi, I come here by tracking IOTDB-648...

Do I want the user to only see history of a thing from timestamp x to timestamp y?

But as I understood, e.g. IoTDB does not have means to insert additional "tags" where such information could be put.

Yes, IoTDB does not support add additional "tags" for a part of data in a series now. (e.g., add a tag "alert" on the data from timestamp x to y in series A). But I have already been considering the feature, not for permission, but for add some useful information on series data to enrich the query semantic. So, if you have more requirement definition about tag on subseries, you can share to me and let's make this feature taking more effect.

Agree to use Ditto to control the permission. Actually, IoTDB itself has permission control, i.e., allowing a user to read or write on a given time series. However, database user is totally different with the terminal user (or, client). A database usually has several users (DBA, user1, user2 etc..), while we may have millions of terminal users (clients).

w4tsn commented 4 years ago

@thjaeckle

Maybe we can come up with a descriptive approach where a piece of configuration would define which HTTP APIs are provided, basically as a kind of "facade" in front of the messages HTTP endpoints.

I've spoken to some frontend developers and they really love the idea. I'm also getting pretty excited about it.

My current understanding is that my microservice defines message types / topics it is able to process and registers these to ditto such that a dynamic swagger doc is updated / added that shows new REST endpoints for certain things. Both updated REST API endpoints and an according swagger doc would be really neat.

Currently the micro-service just opens a WS connection and handles all the API logic which messages are actually excepted and processed locally. By the design of the WS and policy it receives all messages it is allowed to read and ditto currently has no idea about what the service is capable of. Right now I communicate the functionality on a separate channel to the developers.

I've opened issue #682 to track this idea.

JulianFeinauer commented 4 years ago

Another route that I am driving forward is to use the native IoTDB Interface. I already have implemented an OpenID Connect integration for IoTDB for Login / Authentification and if I would use a Ditto PolicyEnforcer based Authorization module you would be able to normaly work against an IoTDB Instance given you use the same OIDC provider than you use for ditto.

w4tsn commented 3 years ago

In the meantime we also have a FastAPI backend that uses the token of the requester (from the same keycloak instance used in ditto) to make a request in the name of that user to ditto to see if a request is allowed or not before acting on other things. This is kinda an indirect check and also not as performant as doing the check locally. If I recall correctly that's what's @BobClaerhout is doing as well or at least talked about in the past.

@thjaeckle it would be great if the policy enforcer would have an implementation in the client libraries (for JavaScript in that regard, or for e.g. python if there starts to be a python client some day). This implementation should then allow retrieving (and caching) the policies to do local enforcement.

An HTTP API specifically for checking authorization on certain properties would at least make it more explicit than just making a request in the name of one user. Also elevated resources still need to either impersonate another identity or have to fiddle with policies manually right now, which would be resolved with that more easily.

thjaeckle commented 3 years ago

@thjaeckle it would be great if the policy enforcer would have an implementation in the client libraries (for JavaScript in that regard, or for e.g. python if there starts to be a python client some day). This implementation should then allow retrieving (and caching) the policies to do local enforcement.

For Java, there is this functionality. For other SDKs, we're open for PullRequests - as those are mainly driven by the community, we're most likely not driving that. FYI: There recently was an initial contribution for a python SDK: https://github.com/eclipse/ditto-clients-python

thjaeckle commented 2 years ago

I think it makes sense to close this issue as it is not on Ditto's goals to natively support historical data. Integration with other TS databases is IMO the better approach - but as this was not driven by anyone in the last years I would close the issue.

thjaeckle commented 1 year ago

Accessing historical data was up to recently out of scope in Ditto - now with issue #1498 this will be added in Ditto 3.2.0

eclipse-ditto / ditto

Historical data from Ditto #545