[Planning] Allow serving user generated content from a separate domain

hexylena commented 8 years ago

xref Trello

natefoo commented 8 years ago

I'm told we can do this. =D

hexylena commented 8 years ago

I'm told that we can do this too. The main point of discussion seems to be security.

Background

This blog covers most of the important points for us, so I'll transclude a portion of it here https://security.googleblog.com/2012/08/content-hosting-for-modern-web.html

In the end, we reacted to this raft of content hosting problems by placing some of the high-risk content in separate, isolated web origins—most commonly *.googleusercontent.com. There, the “sandboxed” files pose virtually no threat to the applications themselves, or to google.com authentication cookies. For public content, that’s all we need: we may use random or user-specific subdomains, depending on the degree of isolation required between unrelated documents, but otherwise the solution just works.

The situation gets more interesting for non-public documents, however. Copying users’ normal authentication cookies to the “sandbox” domain would defeat the purpose. The natural alternative is to move the secret token used to confer access rights from the Cookie header to a value embedded in the URL, and make the token unique to every document instead of keeping it global.

While this solution eliminates many of the significant design flaws associated with HTTP cookies, it trades one imperfect authentication mechanism for another. In particular, it’s important to note there are more ways to accidentally leak a capability-bearing URL than there are to accidentally leak cookies; the most notable risk is disclosure through the Referer header for any document format capable of including external subresources or of linking to external sites.

In our applications, we take a risk-based approach. Generally speaking, we tend to use three strategies:

In higher risk situations (e.g. documents with elevated risk of URL disclosure), we may couple the URL token scheme with short-lived, document-specific cookies issued for specific subdomains of googleusercontent.com. This mechanism, known within Google as FileComp, relies on a range of attack mitigation strategies that are too disruptive for Google applications at large, but work well in this highly constrained use case.

In cases where the risk of leaks is limited but responsive access controls are preferable (e.g., embedded images), we may issue URLs bound to a specific user, or ones that expire quickly.

In low-risk scenarios, where usability requirements necessitate a more balanced approach, we may opt for globally valid, longer-lived URLs.

We have a very similar situation to google here. We should consider our risk model and probably match it to theirs (they do have some smart people I'm told).

Additionally they raise a very good point, embedded images. This will be an interesting point for us. (Update: it is a very boring point. Yay!)

Comparisons

I, personally, would put everything any of us work on squarely in the "low-risk" class. Google uses this classing for user-generated documents with a public share link. That link is a globally valid, longer-lived URL. I see galaxy history elements as being very much analogous to that. We could class things in a higher risk category, but that would likely include serious tradeoffs in usability.

Another comp is to GitHub and their private repos. They provide an auth token on all "raw" links in private repositories.

As mentioned in Google's comments, we cannot just use the cookies to auth the user. We could use the URL as authenticating information (the user knows the history ID and the dataset ID which are two random tokens). However, this suffers the undesirably property that access cannot be changed: once a history is public, anyone knowing that URL would have access to it.

Galaxy Implementation

Galaxies not enabling this feature will be unaffected ideally.
Galaxies implementing this feature will distinguish between two types of histories: public histories, and non-public (but possibly shared) histories.
- Public histories (and their datasets, and images embedded in HTML datasets) need no special access procedures.
- Non-public histories (and ...) will need special access procedures.

Usability

Before we go into possible solutions, we know that we're balancing usability with security here (like always), so what usability do we support?

Currently	Proposed Changes	Improvement?
View link does not grant access if the dataset is not public	View link will include a token and confer access	+
View links are not usable in scripts	View links will be usable in scripts	+
UGC may not include arbitrary HTML, JS	UGC can include these things with decreased risk	+
A view link shared between two people with access, will grant access	(For certain implementations) a view link shared between two people with access may not necessarily grant access	-

Honestly, UGC on a separate domain is starting to sound like a pretty darn good deal ;) This means that whatever implementation we choose, we're already gaining a number of wins over the status quo. That might induce us to consider higher security implementations, given that they will still be advantageous over what we're doing before, and we would not be losing much by choosing them.

Implementation comments

We will need IDs for a subset of the documents, but not necessarily all, nor even many. For histories like Dan's 10k element histories, it may be counterproductive to generate tokens for all of these datasets if he views <1% of them.
This is very analogous to the GitHub case, and we should take inspiration from them. I opened a README in a private repo and the following happened
- I navigated to https://github.com/erasche/X/raw/master/README.md
- I received a 302 for Location:https://raw.githubusercontent.com/erasche/X/master/README.md?token=AA...
I believe we could do the exact same. Hook into the view dataset route and:
- if the dataset is private, redirect with a token
- otherwise redirect to a token-less URL.
- This would play nicely with the embedded images in HTML datasets issue.
Tokens are for read-only access, so don't necessarily demand some of the stricter security policies.
We could have a single token per file, just adding another column to the dataset database. Super easy.

The following comment was made by @natefoo in IRC.

I think the separating domains problem is stuck on whether we're comfortable with a permanent(ish) key to access private data should be part of every view/download link

which is a very valid thing to consider.

For github, the token gets you permanent access to that version of that file.

A token for a specific file at a specific commit will always work.
A token for a file on the tip of a branch will work as long as that file is not changed.
Changing a file forces a new token to be generated.
The old token will return the old version of the file, while the new token returns the new version of the file. It is unknown how long this behaviour persists for.
After a short while (5 min or so) the cache seems to have expired. Both tokens are now valid for the same file.

That is a less appealing model given that our datasets never change, so we should explore more of the options and, more importantly, our threat model.

Threats

The most plausible threat is that a user will accidentally publish a secret token somewhere and then be unable to revoke that.

Permanent Tokens

Easy. So easy.

It is interesting to note that GitHub does not consider this a significant enough threat to defend against it. Do we need to re-evaluate our threat model?

Further, we could use "permanent tokens" but allow manually resetting them in the case of security breaches. A "reset token" button could be added to the pencil icon menu (do these things have names?) for individual datasets, and we could have a history-level reset that functions similarly.

This would seemingly have a nice balance between usability (tokens don't randomly stop working, they stop working at defined events) and security (token can stop working at user defined events). Additionally the history/dataset level resets would be relatively simple to implement, just NULLify any token associated with a dataset.

But what about collaborators? Anyone the user has shared this history with will need to have an access token to access the dataset. Either we use the owner's access token, or we have per-use access tokens.

If we use the owner's access token, let us assume one of the collaborators is evil, and publishes all of the access tokens to their friends. The friends, without proper accounts or authentication, could access the datasets. The owner must then find out about this and revoke the tokens in order to prevent such an attack.

If we use per-user access tokens, our database model becomes more complex, and additionally this attack is still possible. This is a mess. Let's move on.

Non-permanent Tokens

Here we have to define events during which tokens are reset, or choose to reset on every access. Let's assume that we reset on access.

We do not encourage sharing links to view the dataset, instead we encourage sharing histories.
We will sacrifice some of the usability wins we got with permanent tokens (could use URL in a script without bioblend), but I think this will actually be pretty equivalent in terms of UX.
No additional training needs to occur, no one needs to understand the realities of security breaches or accidentally shared links, because they are valid for a single use only.

I am told that Galaxy has code for single-use tokens in the codebase already, so this may not be prohibitive to implement.

Implementation Conclusions

If we re-examine @natefoo's statement there's an interesting clause:

I think the separating domains problem is stuck on whether we're comfortable with a permanent(ish) key to access private data should be part of every view/download link

If we follow github's implementation, this is not the case. Due to the redirection, the user doesn't see the token unless they specifically request to view a file. We can remove that clause, and consider ourselves safe to use a permanent token for URL access. However, we can go further and apply single-use tokens.

Noting this clause, I probably could have started my discussion here, but y'all get my entire through process instead :wink:

Author's Conclusions

Attribute	Permanent	Single User	Winner
View links can be used in scripts	Possible	Impossible	Neither, people should use bioblend
Resetting Tokens	User education must be done, menu items added to dataset + history	Nothing required	Single Use
Risk from publishing a URL	Significant. Unless the owner knows that this URL has been leaked, and knows how to reset it, and cares, that dataset is essentially permanently public.	Non-existent	Single Use

Originally I was very much in favour of permanent access tokens, but after writing this, I'm strongly in favour of single-use tokens.

natefoo commented 8 years ago

Thank you for this analysis, it is amaaaaaaaazing, and thanks for leaving your thought process, I'm sure whatever we implement, people will have questions about, and they can be referred to this.

Like you, I started out thinking permanent revokeable tokens were the way to go, but I think you've swung me to single-use tokens (the redirect helps a lot).

hexylena commented 8 years ago

awwwwwww, thanks <3

yeah, I think the redirect solves nearly all of the potential issues with adding this feature by making it completely transparent to end users and generally a zero impact change if they're doing everything how they're supposed to.

galaxyproject / galaxy