galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.36k stars 987 forks source link

[Planning] Allow serving user generated content from a separate domain #1932

Open hexylena opened 8 years ago

hexylena commented 8 years ago

xref Trello

natefoo commented 8 years ago

I'm told we can do this. =D

hexylena commented 8 years ago

I'm told that we can do this too. The main point of discussion seems to be security.

Background

This blog covers most of the important points for us, so I'll transclude a portion of it here https://security.googleblog.com/2012/08/content-hosting-for-modern-web.html

In the end, we reacted to this raft of content hosting problems by placing some of the high-risk content in separate, isolated web origins—most commonly *.googleusercontent.com. There, the “sandboxed” files pose virtually no threat to the applications themselves, or to google.com authentication cookies. For public content, that’s all we need: we may use random or user-specific subdomains, depending on the degree of isolation required between unrelated documents, but otherwise the solution just works.

The situation gets more interesting for non-public documents, however. Copying users’ normal authentication cookies to the “sandbox” domain would defeat the purpose. The natural alternative is to move the secret token used to confer access rights from the Cookie header to a value embedded in the URL, and make the token unique to every document instead of keeping it global.

While this solution eliminates many of the significant design flaws associated with HTTP cookies, it trades one imperfect authentication mechanism for another. In particular, it’s important to note there are more ways to accidentally leak a capability-bearing URL than there are to accidentally leak cookies; the most notable risk is disclosure through the Referer header for any document format capable of including external subresources or of linking to external sites.

In our applications, we take a risk-based approach. Generally speaking, we tend to use three strategies:

  • In higher risk situations (e.g. documents with elevated risk of URL disclosure), we may couple the URL token scheme with short-lived, document-specific cookies issued for specific subdomains of googleusercontent.com. This mechanism, known within Google as FileComp, relies on a range of attack mitigation strategies that are too disruptive for Google applications at large, but work well in this highly constrained use case.
  • In cases where the risk of leaks is limited but responsive access controls are preferable (e.g., embedded images), we may issue URLs bound to a specific user, or ones that expire quickly.
  • In low-risk scenarios, where usability requirements necessitate a more balanced approach, we may opt for globally valid, longer-lived URLs.

We have a very similar situation to google here. We should consider our risk model and probably match it to theirs (they do have some smart people I'm told).

Additionally they raise a very good point, embedded images. This will be an interesting point for us. (Update: it is a very boring point. Yay!)

Comparisons

I, personally, would put everything any of us work on squarely in the "low-risk" class. Google uses this classing for user-generated documents with a public share link. That link is a globally valid, longer-lived URL. I see galaxy history elements as being very much analogous to that. We could class things in a higher risk category, but that would likely include serious tradeoffs in usability.

Another comp is to GitHub and their private repos. They provide an auth token on all "raw" links in private repositories.

As mentioned in Google's comments, we cannot just use the cookies to auth the user. We could use the URL as authenticating information (the user knows the history ID and the dataset ID which are two random tokens). However, this suffers the undesirably property that access cannot be changed: once a history is public, anyone knowing that URL would have access to it.

Galaxy Implementation

Usability

Before we go into possible solutions, we know that we're balancing usability with security here (like always), so what usability do we support?

Currently Proposed Changes Improvement?
View link does not grant access if the dataset is not public View link will include a token and confer access +
View links are not usable in scripts View links will be usable in scripts +
UGC may not include arbitrary HTML, JS UGC can include these things with decreased risk +
A view link shared between two people with access, will grant access (For certain implementations) a view link shared between two people with access may not necessarily grant access -

Honestly, UGC on a separate domain is starting to sound like a pretty darn good deal ;) This means that whatever implementation we choose, we're already gaining a number of wins over the status quo. That might induce us to consider higher security implementations, given that they will still be advantageous over what we're doing before, and we would not be losing much by choosing them.

Implementation comments

The following comment was made by @natefoo in IRC.

I think the separating domains problem is stuck on whether we're comfortable with a permanent(ish) key to access private data should be part of every view/download link

which is a very valid thing to consider.

For github, the token gets you permanent access to that version of that file.

That is a less appealing model given that our datasets never change, so we should explore more of the options and, more importantly, our threat model.

Threats

The most plausible threat is that a user will accidentally publish a secret token somewhere and then be unable to revoke that.

Permanent Tokens

Easy. So easy.

It is interesting to note that GitHub does not consider this a significant enough threat to defend against it. Do we need to re-evaluate our threat model?

Further, we could use "permanent tokens" but allow manually resetting them in the case of security breaches. A "reset token" button could be added to the pencil icon menu (do these things have names?) for individual datasets, and we could have a history-level reset that functions similarly.

This would seemingly have a nice balance between usability (tokens don't randomly stop working, they stop working at defined events) and security (token can stop working at user defined events). Additionally the history/dataset level resets would be relatively simple to implement, just NULLify any token associated with a dataset.

But what about collaborators? Anyone the user has shared this history with will need to have an access token to access the dataset. Either we use the owner's access token, or we have per-use access tokens.

If we use the owner's access token, let us assume one of the collaborators is evil, and publishes all of the access tokens to their friends. The friends, without proper accounts or authentication, could access the datasets. The owner must then find out about this and revoke the tokens in order to prevent such an attack.

If we use per-user access tokens, our database model becomes more complex, and additionally this attack is still possible. This is a mess. Let's move on.

Non-permanent Tokens

Here we have to define events during which tokens are reset, or choose to reset on every access. Let's assume that we reset on access.

I am told that Galaxy has code for single-use tokens in the codebase already, so this may not be prohibitive to implement.

Implementation Conclusions

If we re-examine @natefoo's statement there's an interesting clause:

I think the separating domains problem is stuck on whether we're comfortable with a permanent(ish) key to access private data should be part of every view/download link

If we follow github's implementation, this is not the case. Due to the redirection, the user doesn't see the token unless they specifically request to view a file. We can remove that clause, and consider ourselves safe to use a permanent token for URL access. However, we can go further and apply single-use tokens.

Noting this clause, I probably could have started my discussion here, but y'all get my entire through process instead :wink:

Author's Conclusions

Attribute Permanent Single User Winner
View links can be used in scripts Possible Impossible Neither, people should use bioblend
Resetting Tokens User education must be done, menu items added to dataset + history Nothing required Single Use
Risk from publishing a URL Significant. Unless the owner knows that this URL has been leaked, and knows how to reset it, and cares, that dataset is essentially permanently public. Non-existent Single Use

Originally I was very much in favour of permanent access tokens, but after writing this, I'm strongly in favour of single-use tokens.

natefoo commented 8 years ago

Thank you for this analysis, it is amaaaaaaaazing, and thanks for leaving your thought process, I'm sure whatever we implement, people will have questions about, and they can be referred to this.

Like you, I started out thinking permanent revokeable tokens were the way to go, but I think you've swung me to single-use tokens (the redirect helps a lot).

hexylena commented 8 years ago

awwwwwww, thanks <3

yeah, I think the redirect solves nearly all of the potential issues with adding this feature by making it completely transparent to end users and generally a zero impact change if they're doing everything how they're supposed to.