fablabbcn / smartcitizen-api

The Smart Citizen Engine
https://developer.smartcitizen.me
GNU Affero General Public License v3.0
10 stars 4 forks source link

Add `data_policy` field in `device` #307

Open oscgonfer opened 4 months ago

oscgonfer commented 4 months ago

Our database is licensed under https://opendatacommons.org/licenses/odbl/

However, we have only one way for the user to control their data by the is_private field. We would like to improve that by adding some additional fields:

allow_forwarding

This is linked to https://github.com/fablabbcn/smartcitizen-api/issues/226, and the idea behind it is that there are various projects in which we are involved in that require having devices data in other servers (servers from other projects). Some of these projects have devices that Fab Lab Barcelona handles, but some other have devices that anyone (i.e. any user) handles, for whom we should not take the decision of actively forwarding their data. In this sense, I define actively as Fab Lab BCN sending device data directly elsewhere, instead of a 3rd party requesting data openly via the API.

For the particular case of "us forwarding data": the allow_forwarding would potentially require us to list those 3rd parties and informing what those 3rd parties would do with the data. To avoid issues, we could specify that we only actively forward data that is marked as forwardeable to specific data aggregators.

precise_location

Regarding location, we currently keep high resolution location, as defined by the user by selecting it on the map. We do not actively update that location, nor we have any other means to determine it besides the user selection on the map. However, it would be a good policy to avoid issues by not storing high resolution location data at all, and by default, storing 3-4 digits only. However, this comes in a bit late, and we may affect how people are using it now. Although not ideal, we could have an opt-in field which would keep high resolution location. For all existing devices, this could be opted-in and users could choose to loose that precision by opting out. For new devices, we could have it low-precision by default, and users could opt-in.

The device serialisation could read:

device:
    data_policy:
        private: bool (False by default)
        precise_location: bool (False by default)
        allow_forwarding: bool (False by default)

This is a very interesting issue, which can be discussed lengthy. I would really appreciate some contribution and other thoughts, @pral2a and @timcowlishaw

oscgonfer commented 3 months ago

Re. the forwarding field, and on the actual implementation of what's needed.

The redis queue data-received (used for push notifications) should contain everything needed for making the forwarding available at platform level.

On a previous take, I was considering bridging the mqtt broker, but this approach is flawed, as we are missing so much information from the device (even the id). This would make us have to create another connection to the database, or keep all the information needed for the forwarding in a separate location (and update it), which doesn't make any sense.

In this sense, I think it makes more sense to think of a small service that can subscribe to the redis queue and send whatever has allow_forwarding = true to a queue of workers for transforming the payloads in case it's necessary. Those workers can then forward that data to another queue (a results store) and a task can subscribe to that for actual forwarding. The image below shows the new blocks in yellow

imagen

In smartcitizen-flows I have a couple of very dirty prototypes of how those conversions would work for the projects we currently have. There is also a prototype of how those workers can be implemented with celery, given that all scdata is written in python and it's based on that. If we think this part well, we can maybe make it so that those workers also process the data of incoming.

pral2a commented 3 months ago

Just adding to the conversation. Notice that the suggested approach favours existing solutions and tries to reduce complexity. There might be other, more standard solutions, yet they will imply more significant architectural changes. See #265

Ingestion Flows summary

SmartCitizen  - Ingestion Flows (2)

Storer Class functions

See source

Proposals

General Forwarding

The push service already provides a general forwarding method when the forwarded part doesn't expect a customized message. Replacing the push service and even the data_received queue with the mqtt service could increase reliability (trusted broker stability) and maintenance (less code to maintain).

Custom Forwarding

As @oscgonfer described in the previous task, Custom Forwarding involves some kind of message transformation to match each platform's expected standards.

Potential Options

pral2a commented 3 months ago

After talking to @oscgonfer sharing some ideas on a future Forward Class

A Forward instance defines how data from each registered device can be forwarded to a specific platform after user consent.

An instance should perform the following functionalities:

Notice: That implementation doesn't provide a mechanism to recover data that has not been forwarded in real-time. However, the broker via session can ensure data is delivered once a forwarded platform reconnects. Nevertheless, that makes me think again that looking at Kafka is an option.

oscgonfer commented 3 months ago

Some additional comments to complete with my notes:

Create a boolean property within each Device to enable users to allow data to be forwarded to the Forward platform.

This is the data_policy>allow_forwarding (bool) field at the top of this issue. User should make this choice, although in some cases, they may be forcibly opting in by participating in the project in question (i.e. CitiObs).

(Optionally) Run custom code whenever a Device is created or updated to communicate with third-party platforms, register the new Device, or obtain information. (Optionally) Run custom code every time a Device publishes (see Storer) to create a custom payload containing the latest data in a form compliant with the receiving platform

As said, both these would be very interesting if we implement them with some sort of schema that can refer to this code (short of a lazy callable) when needed. The code in these schemas can convert from our serialization of choice (what would be the equivalent to current data-received) into something else, both for creation of entities elsewhere, or data forwarding.

(Optionally) Store custom information and other relevant metadata (last time forwarded) per Device (one-to-one)

We can use (and already do) the postprocessing>forwarding_params field for that, BUT, I would suggest to rethink if those are columns in device, or separate tables (currently the latter) for performance. We can improve indexing on this too.

Notice: That implementation doesn't provide a mechanism to recover data that has not been forwarded in real-time. However, the broker via session can ensure data is delivered once a forwarded platform reconnects. Nevertheless, that makes me think again that looking at Kafka is an option.

This is key. However, in our kafka sprint we hit the paywalls quite quickly. Maybe we should now take it on again. If we do not go for something like kafka, I am not sure how much we should rely on the message broker for the persistence on the "way out" of our platform (which I agree is perfect for "the way in"). At the compromise of not having a "fully available" system, I would say that if data is not available "real-time", there is always a way to request it via REST API.


A diagram to show the updated design after our conversation (@pral2a please correct if something wrong):

imagen

Note: with all this, we would remove push service too (and the redis channels for data-received and token-received too (TBC)).

oscgonfer commented 2 months ago

include the auth credentials that we are discussing here

oscgonfer commented 1 month ago

Notes: Any user can have token/username secrets, but we only forward the devices that have forwarding authorized

oscgonfer commented 1 month ago

@pral2a @timcowlishaw conclusions from today's session:

Short term

Next

Next (1)

device:
    data_policy:
        private: bool (False by default)
        precise_location: bool (False by default)
        allow_forwarding: bool (False by default)

Next (2)

Next (3)

Next (4)

oscgonfer commented 2 weeks ago

More on location accuracy and number of digits: https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude

timcowlishaw commented 2 weeks ago

Notes on "precise_location" policy:

New devices will be registed as "imprecise", existing ones all migrated to "precise" and a note put in the forum about the new option

We truncate all existing latlongs to 5dp.

timcowlishaw commented 2 weeks ago

Rathern than truncating from 5dp to 3dp for "imprecise location" devices we should add gaussian noise in the last 2 dp

timcowlishaw commented 2 weeks ago

This happens on write so that we don´t have the precise locations either (even though we know which are precise and which note)

timcowlishaw commented 2 weeks ago

the data_policy itself is only shown for the users themselves and admins

timcowlishaw commented 2 weeks ago

Addressed in #331