e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

user requirements related to "self data" #364

Open PatGendre opened 5 years ago

PatGendre commented 5 years ago

for the record i put here these requirements related to "self data" They can be linked to current work on new architecture with built-in encryption for addressing privacy. See https://github.com/njriasan/e-mission-docs/blob/master/docs/future_work/NewArchitecture.md and issue #330

shankari commented 5 years ago

Couple of quick comments for particular requirements

he/she should also be able to delete part of his/her data @jf87 is also interested in this because of the GDPR requirements. The main challenge is reconciling this with the current assumption that all input data is read-only, so all results are reproducible for all time. The read-only assumption is actually fairly standard for analysis based on datasets.

@jf87 have you seen ML work that addresses relaxing this requirement? I will also do a quick search and see if I can find something. I think that there has been some work on detecting changes are recomputing only related results. I have opened https://github.com/e-mission/e-mission-docs/issues/366 for the more detailed discussion.

shankari commented 5 years ago

there should be very clear user consent terms (I hope we can find some) but obviously this depends on each particular application

Yes, the expectation is that every project can have its own consent terms; the consent terms go into the intro/consent.html. The standard e-mission consent terms can be an example (https://e-mission.eecs.berkeley.edu/consent); if you can indicate what is unclear about them, we can modify that and check in a boilerplate consent into the docs that projects can re-use

@ipsita0012 had some thoughts from their deployment.

shankari commented 5 years ago

going further on the same point, e-mission is a framework that can be reused to build several applications ; I believe its design shall be enough generic so that it can accomodate a large variety of use case, from pure self data app to authorising aggregate studies, and including to crowdsourcing and data sharing apps

This is definitely the goal and the framework has been used for travel surveys, behavior change modification and crowdsourcing (hopefully launching today). Do you have concrete suggestions on how to make it more generic? As the use cases submit their changes, the need for a plugin-based architecture for both the phone and the server has become increasingly clear. Is that what you had in mind?

shankari commented 5 years ago

there should be a clear separation between the individual data and the aggregate data ;- in my views, the aggregate data should be in a separate data base, and provided to the app with a clearly seperate API ; even if data is not encrypted and that no technical ; in some ways, we can consider that the aggregate data functions are another module of the e-mission "suite", but it should ne on a par with third party modules, conceptually distinct from the "core e-mission" functionalities

This is already true at a conceptual level.

The recommended way to access e-mission data is through the timeseries interface emission.storage.timeseries.abstract_timeseries interface (https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/abstract_timeseries.py), NOT directly through the database[1].

The timeseries interface has two options - you can get data for an individual user (get_time_series(user_id)) or for the aggregate (get_aggregate_time_series()). Algorithms that work on aggregate data should use the aggregate timeseries, algorithms that work on a single user should use the regular time series.

As long as all code follows these conceptual guidelines, the rest of it is implementation detail. Although I don't think that having a separate aggregate DB is flexible enough (which ranges would you use for the stored aggregations?), you could certainly cache some results in a separate aggregate database as long as everybody followed the abstractions. And of course, we could choose to switch to a dedicated timeseries database, or give people a choice of databases (e.g. use embedded SQLite for lighter-weight deployments) in the future.

The abstractions are what is important. The database is implementation. People should not box themselves into a corner by using the implementation.

[1] I have no idea why every project just wants to access the database directly instead of using the recommended methods in the Timeseries_sample; suggestions for clarifying this are welcome.

shankari commented 5 years ago

as expressed in this architecture page, I wish the user could refuse that his/her data be aggregated and analysed along with others. The "core" functionality shall be self mobility data and control over sharing it.

Yup, will be in the new architecture!

shankari commented 5 years ago

one criticism over the new architecture is that it is in a way "closed", because the application should use a certain technology (wether it be Graphene or another). In the overview, it is said "the server", as if there was one only server. I believe there are many use cases (including crowdsourcing like in opentraffic or Posmo) that do not imply a sophisticated and secure encryption mechanism and that could be developed possibly without impacting a lot the current architecture. But as you said, this is research, and it is for the longer term, and potentially could also meet the requirements I express.

The initial implementation of the architecture will use docker without graphene. But using docker without graphene has serious limitations if the cloud provider is compromised, so secure execution is really the long-term goal and solution.

I am sure it is possible to implement an ad-hoc crowdsourcing solution for certain specific kinds of analysis (e.g. only automobile speeds as in OpenTraffic) but that is not very interesting to me, because it does not fit my overall vision of longitudinal collection of end to end data across all modes. It seems like you would not even need to store data in that case. You can theoretically compute average speed directly on the phone and send it to a server, but you would need to ensure that sequences of speeds from the same user cannot be correlated. There's been prior work done on that; IIRC vPriv https://people.eecs.berkeley.edu/~raluca/vpriv.pdf uses that model. I am not aware of an open source implementation, though. I don't think any of the work out of Hari's lab at MIT is open source.

shankari commented 5 years ago

Also, wrt crowdsourcing, this article made a big splash when it came out https://www.technologyreview.com/s/523346/how-to-track-vehicles-using-speed-data-alone/

PatGendre commented 5 years ago

Thanks for your remarks!

the framework has been used for travel surveys, behavior change modification and crowdsourcing (hopefully launching today). Do you have concrete suggestions on how to make it more generic? As the use cases submit their changes, the need for a plugin-based architecture for both the phone and the server has become increasingly clear. Is that what you had in mind?

the aggregate data functions should be conceptually distinct from the "core e-mission" functionalities : This is already true at a conceptual level.

it is possible to implement an ad-hoc crowdsourcing solution for certain specific kinds of analysis (e.g. only automobile speeds as in OpenTraffic) but that is not very interesting to me, because it does not fit my overall vision of longitudinal collection of end to end data across all modes

PatGendre commented 5 years ago

there are also documents in French, but this document in English from Uk about personal data might interest you: https://www.digicatapult.org.uk/news-and-views/publication/pdr-report/