CoVital-Project / pulse-ox-data-collection-web-service

HTTPS API for receiving pulse oximetry from mobile clients
https://covital.org
GNU General Public License v3.0
5 stars 4 forks source link

Discussion: MD-collected data vs community-collected data #68

Open YoniSchirris opened 4 years ago

YoniSchirris commented 4 years ago

An issue that popped up since our repo is public, that anyone with the APK can upload data to the DB while putting in fake ground-truth values.

Currently, we have no way of controlling for this.

How can we have a sense of "authentication" without actual authentication?

Ideas:

MalcolmMielle commented 4 years ago

I think it will be important later to be able to separate the data collected from the MD from other data if we end up releasing th ecollection app to more people. I'm not sure which strategy would be best for this.

haggy commented 4 years ago

I've been thinking about this for an hour or so and I have a couple of questions.

  1. How does the access pattern differ between MD and community sourced data? For example Are you going to want to do things like consume the two datasets completely independent of each other (to train independent models, for example)?
  2. What does the onboarding process for clinical professionals look like?

I think the harder question to answer is "How to we restrict app packages based on who's using them?". If we're on-boarding clinical professionals manually, we can just have an obfuscated link for them (a link with ID/Hash values that are hard or impossible to just guess) and a "regular" link for anyone else (testers, random users that stumble on it, etc). We would then only share the obfuscated link with the pros.

The actual layout of the data in S3 is dependent on the answer to question 1 but we have a lot of flexibility. Without knowing the answer to question 1 I see two possible ways to isolate the incoming data streams:

A multi-bucket strategy:

covital-videos-pro-staging/year=<year>/month=<month>/day=<day>/survey-id=<id>/...
covital-videos-patient-staging/year=<year>/month=<month>/day=<day>/survey-id=<id>/...

Or a single bucket strategy:

covital-videos-staging/pro/year=<year>/month=<month>/day=<day>/survey-id=<id>/...
covital-videos-staging/patient/year=<year>/month=<month>/day=<day>/survey-id=<id>/...

I personally think a single bucket strategy is going to reduce complexity and increase flexibility so again without knowing the answer to question 1 I'd suggest that approach.

MalcolmMielle commented 4 years ago
  1. We might want to train independent models. I think what we might end up doing, for example, is train on mixed dataset but then run the test only against the "pro" dataset to be sure we get our results on valid data.
  2. We will distribute the apk to the MD by making available through a download link (the apk will be on a google drive).

Can we not simply request a signedURL for PRO and a different signedURL for others? for example in the openapi, we could simply send a bool true/false depending on who is using it and then use the single bucket strategy?

haggy commented 4 years ago

Can we not simply request a signedURL for PRO and a different signedURL for others

Yes at the API level that's fine but you still need to know at the app level what flag(s) to send on the API call to let the backend know what kind of user is inputting the data. I can easily add a role property to the API that defines something like clinical and patient but you're going to need an APK per "role" so that the app the MDs install know to pass a role of clinical instead of patient. Does that make sense?

kaeawc commented 4 years ago

We can definitely create and distribute multiple APKs with different versions and signatures. That's straightforward.

Another option could be that MDs must authenticate and other users are not required to. From talking with CJ it sounded like we were going with Auth0 to handle authentication - at that point we can ask MDs to contact us for validation before they're approved.

One of the questions that was raised was about API secrets - we can hide credentials, secrets, etc in things like Vault while still being completely open source. That's just one option, and maybe should be a separate discussion from this thread.

haggy commented 4 years ago

@kaeawc Authn is part of the "phase 2" (user-facing app) initiatives. There is additional work involved to get the authn flow right between all the apps and backend so it was decided that the data collection app would be "auth-less". Speed to launch (for the collection app) is critical because product is worried that doctors could soon be too overwhelmed to collect data using the app (which is understandable).

I think that having multiple APK's, one specifically for clinical pros and another for everyone else is good enough for the phase 1 collection app. If that sounds good then I can add a new property to the signed-upload API that allows the app to direct the signed uploads appropriately. Here is what Im thinking:

{
  "role": "clinical | patient"
}

Thoughts?

kaeawc commented 4 years ago

Ahk, gotcha. In that case the variants idea seems like the best option.

MalcolmMielle commented 4 years ago

Yeah the auth version sounds great but the variant is much more straight forward so let's go with that.

If we see a real use in having built that dataset and that the MD are really keen to help maybe we can have a v2 where the MDs have to auth and verify with us so that they can participate in building the dataset. But that would be for later and if the project worked.

Question: is it possible to "kill" the signed url upload for the app later? Let's say we want to collect data only for a certain time using this app, can we make it so it won't upload anymore when we decide we have enough data?

haggy commented 4 years ago

@MalcolmMielle we can disable the API(s) in various ways so that's not a problem.

OK so to summarize:

  1. We will build/package multiple APKs and distribute them based on who is using them (one for clinical pros, another for regular users/patients)
  2. The signed upload API will include a new field called role that the app will set depending on which build it is. Valid values are clinical and patient.

Do we all agree on that method?

@YoniSchirris Does this answer your question?

MalcolmMielle commented 4 years ago

Yes that sounds good! Although maye the role values should be "clinical" and "community" to disctinguish the two cases ?

haggy commented 4 years ago

Sure works for me 😄

YoniSchirris commented 4 years ago

Just to restate Malcolm's comment: Patients won't use this app. This is all for the data collection application. However, the data collected can be done by professional MDs, or by community members :)

The above sounds good. Thanks @haggy

MalcolmMielle commented 4 years ago

@haggy Is the role in the API already :)? I've pulled the latest version but I may be missing it from the doc.

haggy commented 4 years ago

@MalcolmMielle Not yet. Also Im going to change the property name from role to source. It fits its purpose much better. role also could conflict with future authorization functionality.

Property name: source Values: community | clinical

I'm going to add that in tonight and I'll let you know when it's done.

haggy commented 4 years ago

image

New key schema ^^. Local tests are going well so Im going to get this pushed to the sandbox.

haggy commented 4 years ago

FYI @MalcolmMielle Im having some weird issues with the build pipeline. Working through those now.

MalcolmMielle commented 4 years ago

Ok, should I hold off updating the app yet? I'll wait until you tell me it's ok to update the upload func :). Thanks for the great job

haggy commented 4 years ago

Yea hold off for now. Waiting for GH to fix things on their side. Will post here when released.

I'm assuming the app is built against staging?

MalcolmMielle commented 4 years ago

I usually pull master, compile the dart library and push it to this repo to use in the app.

haggy commented 4 years ago

@MalcolmMielle GH has fixed the issues and CI is back online. Im testing on Sandbox and will deploy to staging soon.

haggy commented 4 years ago

@MalcolmMielle @YoniSchirris Released to staging! https://guarded-crag-28391.herokuapp.com/api-docs/#/default/batch-signed-upload-req

Please test and let me know if you run into any issues :D

MalcolmMielle commented 4 years ago

Will do either tonight or tomorrow! :D