[CEP] "Single-use" data links

dimagi / commcare-hq

CommCareHQ is the server backend for CommCare, the world's largest platform for designing, managing, and deploying robust, offline-first, mobile applications to frontline workers worldwide

https://www.dimagi.com/open-source/

BSD 3-Clause "New" or "Revised" License

499 stars 217 forks source link

[CEP] "Single-use" data links #29123

Closed czue closed 3 years ago

czue commented 3 years ago

Abstract

This CEP introduces a new mechanism for interfacing with CommCare data: a "single-use" link.

Single use links expose an API to:

View data from a specific set of cases
(optionally) update data in those cases
(tbd) submit non-case form data?
(tbd) create new cases
(tbd) view other data

Unlike other data mechanisms, single-use links do not require logging into CommCare. Thus they will need to rely on a few alternate measures for privacy and security:

Fine-grained control over what they can do/access.
Setting explicit expiry/validity dates.
Obfuscated non-guessable URL access.
Expiry after being used once.

Motivation

These links will be used for consumer-facing applications and integrations where sign-in is not an option. A longer treatment on the motivation can be found in this document.

Note that this CEP represents a subset of the goals of that document. There will likely be other CEPs in the future to fully meet the document's designed use case.

Specification

These will be implemented as a new model, which might look something like this:

class SingleUseLink(models.Model):
    link_id = models.UUIDField(unique=True, db_index=True, default=uuid.uuid4)
    domain = models.CharField(max_length=126, null=False, db_index=True)
    created_on = models.DateTimeField(auto_now=True)
    expires_on = models.DateTimeField(null=True, blank=True)
    allows_submission = models.BooleanField(default=False, help_text=_('If the link allows data submission'))
    submitting_user = models.ForeignKey(
        User, null=True, blank=True, on_delete=models.SET_NULL,
        help_text=_('For links that allow data submission, the user to be used to submit data.'),
    )
    is_visited = models.BooleanField(default=False)
    visited_on = models.DateTimeField(null=True, blank=True)
    is_used = models.BooleanField(default=False)
    used_on = models.DateTimeField(null=True, blank=True)

class CaseReference(models.Model):

    link = models.ForeignKey(SingleUseLink, on_delete=models.CASCADE, related_name='case_data')
    case_id = models.UUIDField()
    # in the future could also attach metadata

    class Meta:
        unique_together = ('link', 'case_id')

The link data will be accessible via an API. E.g. something like this:

GET /a/domain/api/v0.5/single-use-data/<link_id>

{ "cases": [ {case json}, {case json}] }

And can modify cases in similar fashion. E.g.:

POST /a/domain/api/v0.5/single-use-data/<link_id>

{ "cases": { "<case id 1>": {"p1": "v1"}, "<case id 2>": {"p1": "v2"}, } }

After being used to modify data once, the link will no longer be usable for data retrieval or submission.

Impact on users

No impact at the moment. This is to enable new future workflows.

Impact on hosting

No impact at the moment or forever if they choose not to use the feature.

Backwards compatibility

Fully backwards compatible.

Release Timeline

No concrete timeline yet.

Open questions and issues

The exact API details (and requirements around what data needs to be available and modifiable) are still being worked out. Any input welcome. One random thought is whether the link should be thought of as its own API, versus like a single-use access token into other APIs. I think the former adds less complexity, but the latter would certainly be more flexible long-term.
What obfuscation scheme should we use? Do we need more than random UUIDs for these?
Are there any other security concerns that need to be included at this stage?
Is "single-use" the right name? Some workflows may want multiple-use, so maybe just "data link"?

czue commented 3 years ago

Pinging some potentially interested parties: @proteusvacuum @snopoke @orangejenny @millerdev @ctsims @calellowitz @dannyroberts

snopoke commented 3 years ago

Thanks for the writeup @czue. How does this related to the Case/Patient users CEP for CommonPass? Does it supersede it, Is it a subset or just tangential?

A few other thoughts:

"data link" seems better than 'single use link'
if multi-use is thing then the visited / used details should be a many-to-one
should we record any other data along with the dates? IP, geolocation, user agent?
do the links need to support geo fencing ie. can't be used outside of USA?
What's the different between 'visited' and 'used'?
Some kind of 2fa would be nice - like a token associated with the link which can be shared independently. I'd discussed this with @kaapstorm for the CommonPass integration - using some data known to the user to authenticate them like ID or SSN.
Consider the Proposal for case create/update API when designing the case API (also https://docs.google.com/document/d/1upMFrcItnyxVkD2NoBzpO6pTJRMncYVRVQcrvF5Gfj0/edit#heading=h.u4lr2prqtnkd) - (I see this is already linked in the spec doc but including here anyway)

millerdev commented 3 years ago

Expiry after being used once.

Are these links something that would be used directly by humans? It seems like a usability problem if someone loaded the link by accident (or accidentally hit refresh once the page was loaded), and then became permanently locked out. Would it work to allow access for a limited amount of time after the link is first retrieved, possibly even until expires_on?

czue commented 3 years ago

Thanks for the feedback so far!

@snopoke

How does this related to the Case/Patient users CEP for CommonPass? Does it supersede it, Is it a subset or just tangential?

I chatted with @proteusvacuum about this last week and we concluded that while the two features are attempting to solve very similar-sounding problems, they actually don't have too much in common from a technical workflow perspective. If we ever decided that these anonymous links should be formally tied to a user account, or required some kind of oauth flow, then it might make sense to share more code.

"data link" seems better than 'single use link' if multi-use is thing then the visited / used details should be a many-to-one

:+1:

should we record any other data along with the dates? IP, geolocation, user agent? do the links need to support geo fencing ie. can't be used outside of USA? Some kind of 2fa would be nice - like a token associated with the link which can be shared independently. I'd discussed this with @kaapstorm for the CommonPass integration - using some data known to the user to authenticate them like ID or SSN.

Hopefully we don't need any of these to start, but I'll check on requirements to be sure. Certainly they could be added later if this is used in a highly-sensitive manner.

What's the different between 'visited' and 'used'?

Sorry I should have documented that better. I was thinking that there would be a "use" API (implicit with data submission) that marks the link has been "used up". I think this also addresses the point raised by @millerdev , yeah? So basically it's valid until explicitly marked "used" (or based on some not yet known rule). I just added "visited" since it also seemed useful, but may be YAGNI

millerdev commented 3 years ago

I think this also addresses the point raised by @millerdev, yeah?

Sorry, I didn't follow how that addresses my concern. Am I understanding correctly that visited is just an informational flag, not used to enforce anything, and used is a flag to indicate "this link has been used and is no longer valid"? Were you implying that the used flag is only set after data has been submitted (e.g., a POST request), and the link can be accessed an unlimited number of times (e.g., via GET request, assuming it has not expired) up until data has been submitted?

czue commented 3 years ago

Were you implying that the used flag is only set after data has been submitted (e.g., a POST request), and the link can be accessed an unlimited number of times (e.g., via GET request, assuming it has not expired) up until data has been submitted?

@millerdev yep, exactly. either data submitted or the caller explicitly submits a "this link was used" POST request, or the link expires.

millerdev commented 3 years ago

In that case, yes that does address my concern.

ctsims commented 3 years ago

I'm still a bit confused about how the single link structure fits into what an actual page lifecycle would be, and I'm not sure I operationally agree that opening up public webforms in the same web-app as CommCare will make sense if we are defining the form HTML / JS / CSS independently (I don't think there's a way to let people safely provide dynamic web page content and host it), I think we should be anchoring on either "Public Web App forms on CCHQ" or "Custom web forms on a separate web server." But it sounds like those might be independent questions from the concept of a single-use permissions concept which could exist as a separate web page built against this api.

I agree with @snopoke that any sort of single-use mechanism should come with both an identifier and a second authenticating component. That could be an ID provided to a user over the phone or something internal to the person's data like a date of birth. Either of those would drastically reduce the impact of any failures in managing these leases from becoming a honeypot, and would be in-line with the practices of other actors in the current market (test result delivery and other points of contact sharing both have a GUID+private factor)

Three other points of feedback

1) I don't think the current workflow example is quite achievable by the level of complexity, but I might be missing something. Contact sharing workflows don't just require some amount of information (or the ability to update or create) from one case, they require creating multiple child / extension cases, and may require additional context (like fixtures) to get the context to do so. In other systems those choices and details aren't hard-coded, they are pulled from separate data sources (or from our API's) with dynamic web requests based on user input or case data before being submitted.

2) I think it might make sense to "pre-prep" more of what data is accessible than simply ID's to prevent escalation of privilege from allowing people to read more data than anticipated. An example of that might be building the payload independently and only providing the payload with the link, rather than performing the request live and filtering. That might be overkill, but it will also enable us to audit exactly what data was made available, which seems important for a high risk mechanism.

If we needed to do any translation of what data should be available, to, say, support #1, prepackaging may also make it possible for an authenticated actor to perform a more complex preselection of data based on their privileges, rather than passing "terms" of filtering-for-acceptability downwards. An example of that might be if the external form system needs 10 discrete fields from the case but if it would be dangerous to provide access to the full case model. Pre-packaged data could be filtered from the full dataset when the one-time-link is created, whereas a dynamic request based on the link would need to encode the full context for the filtering.

3) It wonder whether it would be worth having both a model for the existence of a on-time-link (which encodes the data and permissions it has access to), and a separate API / model for an acquired lease on that portable permissions set, splitting up the concerns involved. The link models in that world could be immutable (good for their security and auditability), and having a lease model that eventually ends up with context in a local cookie after being acquired would let someone fairly safely perform complex interactions against that encapsulated set of data and also allow us to store some amount of context to help with security and auditing.

czue commented 3 years ago

@ctsims this is great feedback, thanks for weighing in.

I'm still a bit confused about how the single link structure fits into what an actual page lifecycle would be, and I'm not sure I operationally agree that opening up public webforms in the same web-app as CommCare will make sense if we are defining the form HTML / JS / CSS independently (I don't think there's a way to let people safely provide dynamic web page content and host it), I think we should be anchoring on either "Public Web App forms on CCHQ" or "Custom web forms on a separate web server." But it sounds like those might be independent questions from the concept of a single-use permissions concept which could exist as a separate web page built against this api.

Yeah, this piece is still very much in flux which is why I haven't proposed anything yet. There's a small section discussing it in the spec which includes the web apps option as a promising answer.

My hope behind the single-use link idea is that it could be used by web apps, an external site, or maybe some new thing in HQ that we may or may not build.

I don't think the current workflow example is quite achievable by the level of complexity, but I might be missing something. Contact sharing workflows don't just require some amount of information (or the ability to update or create) from one case, they require creating multiple child / extension cases, and may require additional context (like fixtures) to get the context to do so. In other systems those choices and details aren't hard-coded, they are pulled from separate data sources (or from our API's) with dynamic web requests based on user input or case data before being submitted.

Ah, this is great to know. Are these use cases documented somewhere? I was sent user-manual style docs, but it wasn't clear what the technical dependencies were....

I think it might make sense to "pre-prep" more of what data is accessible than simply ID's to prevent escalation of privilege from allowing people to read more data than anticipated. An example of that might be building the payload independently and only providing the payload with the link, rather than performing the request live and filtering. That might be overkill, but it will also enable us to audit exactly what data was made available, which seems important for a high risk mechanism.

I did consider this option but then got hung up on what to do if some of the data has since changed in the case (I guess you'd invalidate the link and make a new one?). Or we could solve it by "encoding the full context for the filtering" as you suggested - which starts to sound a lot like GraphQL. If we have to encode complex submission rules anyways as per your point 1 then maybe introducing dynamic property filtering isn't that much additional work.

It wonder whether it would be worth having both a model for the existence of a on-time-link (which encodes the data and permissions it has access to), and a separate API / model for an acquired lease on that portable permissions set, splitting up the concerns involved. The link models in that world could be immutable (good for their security and auditability), and having a lease model that eventually ends up with context in a local cookie after being acquired would let someone fairly safely perform complex interactions against that encapsulated set of data and also allow us to store some amount of context to help with security and auditing.

This sounds like an interesting idea. Is there an analogous standard or system that you know of that uses this "leasing" model that I could read more about?

In any event, I'll need to circle back with @jjackson on requirements, as this all sounds more complicated than I originally understood.

ctsims commented 3 years ago

Ah, this is great to know. Are these use cases documented somewhere? I was sent user-manual style docs, but it wasn't clear what the technical dependencies were....

I'm actually not 100% sure precisely what the final designs turned out to be, so they might have been more limited.

I believe that what is submitted by the known real teams is a single payload that contains N new cases (one per contact), each of which has an index pointing to the patient, and those cases need access to the correct owner_id, which they currently retrieve from a list stored in a fixture with a separate API request and then choose based on region.

I did consider this option but then got hung up on what to do if some of the data has since changed in the case (I guess you'd invalidate the link and make a new one?).

I agree that this is a limitation, although none of the current Single-Use integrations actually rely on data that really changes from my understanding. For contact sharing, the only fields that are shared are the Patient DoB (used to validate the Patient) and their ID (used to create the response payload).

In some ways the limitation of not dynamically determining what data is shared feels a bit like it could be a 'good' limitation to me. It enables the known and safe use cases like delivering a test result or allowing contact entry without extending an unknown surface area.

This sounds like an interesting idea. Is there an analogous standard or system that you know of that uses this "leasing" model that I could read more about?

I think this is just a bit like applying authentication through a session token scheme.

A lab test website I used functioned a bit like this, I think. They sent a link which I think is analogous to your single-use model, once I clicked on it they asked me for a DoB and then after sent an OTP with a short timeout to my email to complete authentication and enable the session.

czue commented 3 years ago

fyi I'm reworking this so will close for now.