Add `data_policy` field in `device`

oscgonfer commented 4 months ago

Our database is licensed under https://opendatacommons.org/licenses/odbl/

However, we have only one way for the user to control their data by the is_private field. We would like to improve that by adding some additional fields:

precise_location: bool whether or not the location is to be stored with X number of digits precision or not
allow_forwarding: bool for users to allow data to be forwarded or not to other platforms (data aggregators)

`allow_forwarding`

This is linked to https://github.com/fablabbcn/smartcitizen-api/issues/226, and the idea behind it is that there are various projects in which we are involved in that require having devices data in other servers (servers from other projects). Some of these projects have devices that Fab Lab Barcelona handles, but some other have devices that anyone (i.e. any user) handles, for whom we should not take the decision of actively forwarding their data. In this sense, I define actively as Fab Lab BCN sending device data directly elsewhere, instead of a 3rd party requesting data openly via the API.

For the particular case of "us forwarding data": the allow_forwarding would potentially require us to list those 3rd parties and informing what those 3rd parties would do with the data. To avoid issues, we could specify that we only actively forward data that is marked as forwardeable to specific data aggregators.

`precise_location`

Regarding location, we currently keep high resolution location, as defined by the user by selecting it on the map. We do not actively update that location, nor we have any other means to determine it besides the user selection on the map. However, it would be a good policy to avoid issues by not storing high resolution location data at all, and by default, storing 3-4 digits only. However, this comes in a bit late, and we may affect how people are using it now. Although not ideal, we could have an opt-in field which would keep high resolution location. For all existing devices, this could be opted-in and users could choose to loose that precision by opting out. For new devices, we could have it low-precision by default, and users could opt-in.

The device serialisation could read:

device:
    data_policy:
        private: bool (False by default)
        precise_location: bool (False by default)
        allow_forwarding: bool (False by default)

This is a very interesting issue, which can be discussed lengthy. I would really appreciate some contribution and other thoughts, @pral2a and @timcowlishaw

oscgonfer commented 3 months ago

Re. the forwarding field, and on the actual implementation of what's needed.

The redis queue data-received (used for push notifications) should contain everything needed for making the forwarding available at platform level.

On a previous take, I was considering bridging the mqtt broker, but this approach is flawed, as we are missing so much information from the device (even the id). This would make us have to create another connection to the database, or keep all the information needed for the forwarding in a separate location (and update it), which doesn't make any sense.

In this sense, I think it makes more sense to think of a small service that can subscribe to the redis queue and send whatever has allow_forwarding = true to a queue of workers for transforming the payloads in case it's necessary. Those workers can then forward that data to another queue (a results store) and a task can subscribe to that for actual forwarding. The image below shows the new blocks in yellow

imagen

In smartcitizen-flows I have a couple of very dirty prototypes of how those conversions would work for the projects we currently have. There is also a prototype of how those workers can be implemented with celery, given that all scdata is written in python and it's based on that. If we think this part well, we can maybe make it so that those workers also process the data of incoming.

pral2a commented 3 months ago

Just adding to the conversation. Notice that the suggested approach favours existing solutions and tries to reduce complexity. There might be other, more standard solutions, yet they will imply more significant architectural changes. See #265

Ingestion Flows summary

SmartCitizen - Ingestion Flows (2)

Storer Class functions

See source

Updates Device model
Stores timeseries to Kairos by submitting a partial of the Device data to the Redis Queue telnet_queue
Publishes the full Device model to the Redis Queue data_received for other services to consume the data. Currently only used by the push service, that uses Socket.IO to openly stream (public) data over WebSocket at i.e. ws.smartcitizen.me.

Proposals

General Forwarding

The push service already provides a general forwarding method when the forwarded part doesn't expect a customized message. Replacing the push service and even the data_received queue with the mqtt service could increase reliability (trusted broker stability) and maintenance (less code to maintain).

The broker supports MQTT (and MQTT over WebSocket).
The broker AAA features could manage access to each receiving platform. Credentials can be manually granted or programmatically over the broker REST API. New features introduced for device AAA will also benefit data consumers / forwarded platforms using this method.
Forwarded services unable to subscribe to the broker could use the Data Bridge feature in EMQX. However, each forwarded service must be manually configured or parametrized using the broker REST API.
The broker will also become a data provider, and the push service will be deprecated (ending ws.smartcitizen.me).

Custom Forwarding

As @oscgonfer described in the previous task, Custom Forwarding involves some kind of message transformation to match each platform's expected standards.

Potential Options

Multiple forwarding services data_received could remain an internal queue to which multiple forwarding services could subscribe. The Redis configuration for data_received to support various subscribers needs to be assessed. Services could be orchestrated as separate containers like push. That considers forwarding not only implies customizing the message but also the connection to each platform.
Built-in forwarding services: Assuming all forwarded services connect to our platform the same way, we could take the General forwarding approach to delivering messages through the mqtt service, yet we extend the Storer class with multiple methods to publish data to the broker based on the requirements of various data receivers. Further on, we could add an initializer class that parametrizes the broker using the REST API to set receivers AAA and/or Data Bridge options in the future.
Combined Both options, including General Forwarding, could be combined or co-exist with minor adjustments.

pral2a commented 3 months ago

After talking to @oscgonfer sharing some ideas on a future Forward Class

A Forward instance defines how data from each registered device can be forwarded to a specific platform after user consent.

An instance should perform the following functionalities:

Create a boolean property within each Device to enable users to allow data to be forwarded to the Forward platform
(Optionally) Run custom code whenever a Device is created or updated to communicate with third-party platforms, register the new Device, or obtain information.
(Optionally) Store custom information and other relevant metadata (last time forwarded) per Device (one-to-one)
(Optionally) Run custom code every time a Device publishes (see Storer) to create a custom payload containing the latest data in a form compliant with the receiving platform
(Optionally) Parametrize the MQTT broker with specific topics, such as authorization and authentication or broker bridges, that are particular to how the receiving platform connects.
Publish the custom or standard payload to the MQTT-specific topics created before every time data is received.
(Maybe) Allow custom code to forward the payload to the platform instead of the MQTT option. i.e., the receiving platform expects data over the GraphQL interface.

Notice: That implementation doesn't provide a mechanism to recover data that has not been forwarded in real-time. However, the broker via session can ensure data is delivered once a forwarded platform reconnects. Nevertheless, that makes me think again that looking at Kafka is an option.

oscgonfer commented 3 months ago

Some additional comments to complete with my notes:

Create a boolean property within each Device to enable users to allow data to be forwarded to the Forward platform.

This is the data_policy>allow_forwarding (bool) field at the top of this issue. User should make this choice, although in some cases, they may be forcibly opting in by participating in the project in question (i.e. CitiObs).

(Optionally) Run custom code whenever a Device is created or updated to communicate with third-party platforms, register the new Device, or obtain information. (Optionally) Run custom code every time a Device publishes (see Storer) to create a custom payload containing the latest data in a form compliant with the receiving platform

As said, both these would be very interesting if we implement them with some sort of schema that can refer to this code (short of a lazy callable) when needed. The code in these schemas can convert from our serialization of choice (what would be the equivalent to current data-received) into something else, both for creation of entities elsewhere, or data forwarding.

(Optionally) Store custom information and other relevant metadata (last time forwarded) per Device (one-to-one)

We can use (and already do) the postprocessing>forwarding_params field for that, BUT, I would suggest to rethink if those are columns in device, or separate tables (currently the latter) for performance. We can improve indexing on this too.

Notice: That implementation doesn't provide a mechanism to recover data that has not been forwarded in real-time. However, the broker via session can ensure data is delivered once a forwarded platform reconnects. Nevertheless, that makes me think again that looking at Kafka is an option.

This is key. However, in our kafka sprint we hit the paywalls quite quickly. Maybe we should now take it on again. If we do not go for something like kafka, I am not sure how much we should rely on the message broker for the persistence on the "way out" of our platform (which I agree is perfect for "the way in"). At the compromise of not having a "fully available" system, I would say that if data is not available "real-time", there is always a way to request it via REST API.

A diagram to show the updated design after our conversation (@pral2a please correct if something wrong):

imagen

Note: with all this, we would remove push service too (and the redis channels for data-received and token-received too (TBC)).

oscgonfer commented 2 months ago

include the auth credentials that we are discussing here

oscgonfer commented 1 month ago

Notes: Any user can have token/username secrets, but we only forward the devices that have forwarding authorized

oscgonfer commented 1 month ago

@pral2a @timcowlishaw conclusions from today's session:

Short term

[x] Kill ws Redis in the storer.rb
[x] Kill the push service and remove it from compose.yml

Next (1)

[x] Include fields in device for device_policyas described below:

device:
    data_policy:
        private: bool (False by default)
        precise_location: bool (False by default)
        allow_forwarding: bool (False by default)

[x] Keep in mind that we will not change the precision of the location now, but by default we should not store it from now on

Next (2)

[ ] Forwarding goes into a job (for now it still queries the DB for rendering the JSON payload)
[ ] Check what we need to do if we want to scale sidekiq (multiple machines/processes/...)
[ ] Check that the jobs are triggered on HTTP posts as well for the processing of data
[ ] Profiling again and compare

Next (3)

[ ] Check how we can avoid querying the database for each forwarding and follow a predefined schema
[ ] Check for ways to batch forward messages upon batch processing
[ ] Check if we can make the JSON serialisation more optimized to avoid sensing some fields in device that are not necessary AND potentially add custom formats???
[ ] Profiling again and compare

Next (4)

[ ] Decide if we want to move postgres to a job?
[ ] Profiling again

oscgonfer commented 2 weeks ago

More on location accuracy and number of digits: https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude