user requirements related to "self data"

e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.

https://e-mission.readthedocs.io/en/latest

BSD 3-Clause "New" or "Revised" License

15 stars 34 forks source link

user requirements related to "self data" #364

Open PatGendre opened 5 years ago

PatGendre commented 5 years ago

for the record i put here these requirements related to "self data" They can be linked to current work on new architecture with built-in encryption for addressing privacy. See https://github.com/njriasan/e-mission-docs/blob/master/docs/future_work/NewArchitecture.md and issue #330

the user should be able to download his/her data easily (this is currently the case)
he/she should also be able to delete part of his/her data
there should be very clear user consent terms (I hope we can find some) but obviously this depends on each particular application
going further on the same point, e-mission is a framework that can be reused to build several applications ; I believe its design shall be enough generic so that it can accomodate a large variety of use case, from pure self data app to authorising aggregate studies, and including to crowdsourcing and data sharing apps
there should be a clear separation between the individual data and the aggregate data ;- in my views, the aggregate data should be in a separate data base, and provided to the app with a clearly seperate API ; even if data is not encrypted and that no technical ; in some ways, we can consider that the aggregate data functions are another module of the e-mission "suite", but it should ne on a par with third party modules, conceptually distinct from the "core e-mission" functionalities
as expressed in this architecture page, I wish the user could refuse that his/her data be aggregated and analysed along with others. The "core" functionality shall be self mobility data and control over sharing it.
one criticism over the new architecture is that it is in a way "closed", because the application should use a certain technology (wether it be Graphene or another). In the overview, it is said "the server", as if there was one only server. I believe there are many use cases (including crowdsourcing like in opentraffic or Posmo) that do not imply a sophisticated and secure encryption mechanism and that could be developed possibly without impacting a lot the current architecture. But as you said, this is research, and it is for the longer term, and potentially could also meet the requirements I express.

shankari commented 5 years ago

Couple of quick comments for particular requirements

he/she should also be able to delete part of his/her data @jf87 is also interested in this because of the GDPR requirements. The main challenge is reconciling this with the current assumption that all input data is read-only, so all results are reproducible for all time. The read-only assumption is actually fairly standard for analysis based on datasets.

@jf87 have you seen ML work that addresses relaxing this requirement? I will also do a quick search and see if I can find something. I think that there has been some work on detecting changes are recomputing only related results. I have opened https://github.com/e-mission/e-mission-docs/issues/366 for the more detailed discussion.

shankari commented 5 years ago

there should be very clear user consent terms (I hope we can find some) but obviously this depends on each particular application

Yes, the expectation is that every project can have its own consent terms; the consent terms go into the intro/consent.html. The standard e-mission consent terms can be an example (https://e-mission.eecs.berkeley.edu/consent); if you can indicate what is unclear about them, we can modify that and check in a boilerplate consent into the docs that projects can re-use

@ipsita0012 had some thoughts from their deployment.

shankari commented 5 years ago

going further on the same point, e-mission is a framework that can be reused to build several applications ; I believe its design shall be enough generic so that it can accomodate a large variety of use case, from pure self data app to authorising aggregate studies, and including to crowdsourcing and data sharing apps

This is definitely the goal and the framework has been used for travel surveys, behavior change modification and crowdsourcing (hopefully launching today). Do you have concrete suggestions on how to make it more generic? As the use cases submit their changes, the need for a plugin-based architecture for both the phone and the server has become increasingly clear. Is that what you had in mind?

shankari commented 5 years ago

there should be a clear separation between the individual data and the aggregate data ;- in my views, the aggregate data should be in a separate data base, and provided to the app with a clearly seperate API ; even if data is not encrypted and that no technical ; in some ways, we can consider that the aggregate data functions are another module of the e-mission "suite", but it should ne on a par with third party modules, conceptually distinct from the "core e-mission" functionalities

This is already true at a conceptual level.

The recommended way to access e-mission data is through the timeseries interface emission.storage.timeseries.abstract_timeseries interface (https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py), NOT directly through the database[1].

The timeseries interface has two options - you can get data for an individual user (get_time_series(user_id)) or for the aggregate (get_aggregate_time_series()). Algorithms that work on aggregate data should use the aggregate timeseries, algorithms that work on a single user should use the regular time series.

As long as all code follows these conceptual guidelines, the rest of it is implementation detail. Although I don't think that having a separate aggregate DB is flexible enough (which ranges would you use for the stored aggregations?), you could certainly cache some results in a separate aggregate database as long as everybody followed the abstractions. And of course, we could choose to switch to a dedicated timeseries database, or give people a choice of databases (e.g. use embedded SQLite for lighter-weight deployments) in the future.

The abstractions are what is important. The database is implementation. People should not box themselves into a corner by using the implementation.

[1] I have no idea why every project just wants to access the database directly instead of using the recommended methods in the Timeseries_sample; suggestions for clarifying this are welcome.

shankari commented 5 years ago

as expressed in this architecture page, I wish the user could refuse that his/her data be aggregated and analysed along with others. The "core" functionality shall be self mobility data and control over sharing it.

Yup, will be in the new architecture!

shankari commented 5 years ago

one criticism over the new architecture is that it is in a way "closed", because the application should use a certain technology (wether it be Graphene or another). In the overview, it is said "the server", as if there was one only server. I believe there are many use cases (including crowdsourcing like in opentraffic or Posmo) that do not imply a sophisticated and secure encryption mechanism and that could be developed possibly without impacting a lot the current architecture. But as you said, this is research, and it is for the longer term, and potentially could also meet the requirements I express.

The initial implementation of the architecture will use docker without graphene. But using docker without graphene has serious limitations if the cloud provider is compromised, so secure execution is really the long-term goal and solution.

I am sure it is possible to implement an ad-hoc crowdsourcing solution for certain specific kinds of analysis (e.g. only automobile speeds as in OpenTraffic) but that is not very interesting to me, because it does not fit my overall vision of longitudinal collection of end to end data across all modes. It seems like you would not even need to store data in that case. You can theoretically compute average speed directly on the phone and send it to a server, but you would need to ensure that sequences of speeds from the same user cannot be correlated. There's been prior work done on that; IIRC vPriv https://people.eecs.berkeley.edu/~raluca/vpriv.pdf uses that model. I am not aware of an open source implementation, though. I don't think any of the work out of Hari's lab at MIT is open source.

shankari commented 5 years ago

Also, wrt crowdsourcing, this article made a big splash when it came out https://www.technologyreview.com/s/523346/how-to-track-vehicles-using-speed-data-alone/

PatGendre commented 5 years ago

Thanks for your remarks!

handle deleting data #366 : for the moment I cannot think of further inputs, sorry, but the requirement (at least to delete all data and sign off) is real
the intro/consent.html seems clear to me, but as you say later on, the consent terms vary from one application to the other (e.g. I believe that for some or even many use cases, the user shall have the choice not to authorise sharing data to researchers, and to use the app for its own sake as a personal tool)

the framework has been used for travel surveys, behavior change modification and crowdsourcing (hopefully launching today). Do you have concrete suggestions on how to make it more generic? As the use cases submit their changes, the need for a plugin-based architecture for both the phone and the server has become increasingly clear. Is that what you had in mind?

yes, exactly. As for suggestions : it was the idea discussed after of separating more markedly individual and collective/aggregate data as two distinct applications

the aggregate data functions should be conceptually distinct from the "core e-mission" functionalities : This is already true at a conceptual level.

thanks for the explanations! I agree the conceptual level is essential. You're right, I am not competent enough to assert that there should be 2 databases. However, I believe the separation should be also made clear at the implementation (developer point of view : guidelines / sdk; there is "human" tendency to access directly the db, if it is possible; maybe have 2 separate timeseries interface instead of 2 options would be clearer? I should think more about it in order to be able to make concrete suggestions) and application level (user point of view; the aggregate functions should "look" different or be in a different app)

it is possible to implement an ad-hoc crowdsourcing solution for certain specific kinds of analysis (e.g. only automobile speeds as in OpenTraffic) but that is not very interesting to me, because it does not fit my overall vision of longitudinal collection of end to end data across all modes

definitely, open traffic would not even need a server pipeline, I agree this is not what e-mission has been designed for. The point was that if the basic e-mission functionalities could also implement the and that we found a "blockbuster" crowdsourcing usage, this could help a lot finding resources for building more advanced features. But definitely this won't be easy to meet the demand for massive crowdsourcing, it was just a reflection.

PatGendre commented 5 years ago

there are also documents in French, but this document in English from Uk about personal data might interest you: https://www.digicatapult.org.uk/news-and-views/publication/pdr-report/