Download - Githubissues

alexmv commented 2 years ago

Description

If go-camo is served from its own hostname, one cannot use <a href="..." download> to proxy download links through Camo, since the download attribute is limited to same-origin links.

Extend the URL schema to allow for a /download trailer, which, if set, override Content-Disposition to attachment, as well as providing a hint at the filename, either from the remote Content-Disposition header, or from the original URL.

Checklist

[x] Code compiles correctly
[x] Created tests (if appropriate)
[x] All tests passing
[x] Extended the README / documentation (if necessary)

alexmv commented 2 years ago

I have seen the newly-added tests flake in CI, in a manner which suggests that wikipedia (used in the tests) was not sending a filename= in the Content-Disposition it returned. But it has never failed in local tests that way, though.

I'm surprised that this testsuite uses real live websites as the test endpoints, but it does seem unfortunately not straightforward to stub those out.

alexmv commented 2 years ago

Yeah, this test failure is what referred to above: https://github.com/cactus/go-camo/runs/5346725222?check_suite_focus=true

Locally:

ilwrath ~/src/go-camo (download>) $ make test
Running tests...
?       github.com/cactus/go-camo/cmd/go-camo   [no test files]
?       github.com/cactus/go-camo/cmd/url-tool  [no test files]
ok      github.com/cactus/go-camo/pkg/camo  3.373s
ok      github.com/cactus/go-camo/pkg/camo/encoding 0.254s
ok      github.com/cactus/go-camo/pkg/htrie 0.163s
ok      github.com/cactus/go-camo/pkg/router    4.359s

dropwhile commented 2 years ago

I'll have to think about this one a bit.

With the addition of a /download url component:

It would be slightly more effort to parse the url (not a big deal)
The /download component is not part of the url signature, which means a user could append that url component after url signature generation. I believe most browsers would not prompt to download for a <img src=".../download"> construction, but I'm not sure how implementation specific this is. (possibly a big deal, maybe not)
Added complexity (uncertain)

Since you mentioned a workaround (same-origin), can you describe a specific use-case where image content would require an html5 download attribute? (vs just rendering an image and supporting right click downloading)

alexmv commented 2 years ago

I'll have to think about this one a bit.

That's totally reasonable.

With the addition of a /download url component:

It would be slightly more effort to parse the url (not a big deal)

I think the added complexity here isn't large -- but some minor duplication (checking the number of path components) between pkg.camo.ServeHTTP and pkg.router.ServeHTTP can probably be cleaned up. Happy to do some of that if you'd like!

The /download component is not part of the url signature, which means a user could append that url component after url signature generation.

Yeah -- putting it into the signature gets complicated, since it needs to be done in a way that is backwards-compatible to existing signatures. Since the current content of the HMAC is just the URL, without any surrounding metadata, there isn't an easy way to encode the "download" flag. The only reasonable option would be to prepend something for download URLs, which gets kludgy, since it needs to be something which can't occur at the start of a valid URL (e.g. a null, probably?).

At the same time, it doesn't seem to me like being able to mark a file as content-disposition: attachment is actually exploitable in any way. We're already shipping the client the image content -- it's not like they can't save the image unless we mark it as an attachment (in fact, the vulnerabilities have historically been the other way around, when browsers inline content which they should instead download; see below). Letting clients decide they want an easy way to save the bytes seems like convenience and non-exploitable to me.

I believe most browsers would not prompt to download for a <img src=".../download"> construction, but I'm not sure how implementation specific this is. (possibly a big deal, maybe not)

That'd have to be an intentional choice on the part of whoever was constructing the img tag. But, regardless, Chrome, Firefox, and Safari (the browsers I have on hand) happily display the image inline with no "save as..." popup.

Added complexity (uncertain)

I think this is mostly entangled with the first point, about URL processing.

Since you mentioned a workaround (same-origin), can you describe a specific use-case where image content would require an html5 download attribute? (vs just rendering an image and supporting right click downloading)

Sure, happy to explain more!

We're using go-camo for privacy reasons in Zulip, when displaying images inline for image URLs that are pasted in chat. When you click on the image, we also open it in a lightbox. This lightbox has a "download" link, for convenience. For deployments where camo is deploys behind the same hostname as he rest of Zulip (the most common configuration) we can use <a href="/path-to-camo/..." download> for that link.

However, in large deployments (e.g. our cloud hosted environment) each chat organization is on its own hostname, and we deploy one go-camo instance on its own hostname. The download attribute only works on same-origin URLs, which means that we cannot use <a href="https://go-camo-host/..." download> in those links -- clicking that "download" link will open the image, not download it.

We thus need a mechanism to set the content-disposition. We can either do so inside go-camo, or via a rewrite layer which sits in front of go-camo. Since go-camo has the full response header and URL information for the remote host, it seems the more natural place to put it.

We'd prefer to not put the go-camo host on the same hostname (via path rewriting), as that potentially opens up XSS vulnerabilities if browsers decide to do something wild like run JS embedded in SVG images. Keeping the go-camo content (which is fundamentally untrusted) segregated onto its own domain helps us keep a safety buffer, by using the lack of same-origin as protection. Which is great, until it isn't. :)

Hope that clarifies things!

dropwhile commented 2 years ago

Thanks for the comprehensive response -- this really helps me understand the use-case.

alexmv commented 2 years ago

Any further thoughts on if you'd object to this feature or not?

Adding some debugging, AFAICT the response from wikimedia sometimes is lacking a Content-Disposition, but that's the only header which is changed (?!). I suspect some odd caching layer.

I'll find a new set of images to hit in tests.

dropwhile commented 2 years ago

@alexmv Sorry, work has been very busy recently.

I have thought about this a bit in the meantime though.

Some mental notes follows

In my experience most sites do not supply a content-disposition on standard get requests, so I think parsing the upstream response is going to be futile most of the time, and to reduce potential for issues with trying to parse it, best just avoided entirely.

Then the question is what filename to use?

Some urls don't even contain a useful image name in the request path either, for example giphy urls. https://media1.giphy.com/media/JQQwgVUMDIyAM/giphy.gif
It would be easiest to just use a default filename like "image.{standard-mime-suffix}", but that isn't very friendly.
Another option would be to use the last path component of the camo url, with the underlying mime suffix. For example, with a camo image url of https://img.example.org/D23vHLFHsOhPOcvdxeoQyAJTpvM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n the filename would be aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n.png. Not sure this is any better than the previous option.
I suppose if a query parameter was used instead of a path element, the filename could be supplied as part of the query param: <a href="...?download=somename.gif" download>. The downside is that this exposes the filename to user modification.

Not sure any of these solutions are especially great.

I think the best/safest solution is to support a query param (instead of a path component) that is enabled/disabled via a cli flag, and checks for a url with a query argument like ?download. When detected, sets content-disposition: attachment;, but does not set a filename. Let the browser figure out a filename based on the last portion of the camo url and the content-type.

alexmv commented 2 years ago

@alexmv Sorry, work has been very busy recently.

No worries!

In my experience most sites do not supply a content-disposition on standard get requests, so I think parsing the upstream response is going to be futile most of the time, and to reduce potential for issues with trying to parse it, best just avoided entirely.

Hm. What sorts of issues are you envisioning having? There's a clear spec -- and in the case where it's malformed, we can ignore it. But it seems wasteful to throw away the information if we might get it.

Then the question is what filename to use?

Some urls don't even contain a useful image name in the request path either, for example giphy urls. https://media1.giphy.com/media/JQQwgVUMDIyAM/giphy.gif

This seems like the most straight-forward to me, and the most transparent form. It's the algorithm that most browsers use if you download an image without a filename in the Content-Disposition, so it stands to reason that if we're trying to pretend that Camo doesn't exist, that's what it should suggest the browser use. Browsers already deal with giphy (2).gif if they're downloading directly from the server -- and it doesn't make sense to me to throw out all attempts at parsing the path because some of them are too generic.

I think the best/safest solution is to support a query param (instead of a path component) that is enabled/disabled via a cli flag, and checks for a url with a query argument like ?download. When detected, sets content-disposition: attachment;, but does not set a filename. Let the browser figure out a filename based on the last portion of the camo url and the content-type.

Query parameter seems like it probably makes more sense than adding another path component, sure.

But I think the browser's behaviour of saving 68747470733a2f2f6578616d706c652e636f6d2f6361742e676966.gif instead of cat.gif isn't great, and throws away information which the browser would have if they were making the request to the server directly.

dropwhile commented 2 years ago

I have an experimental branch that builds atop your changes: https://github.com/cactus/go-camo/compare/pr-57 I'll play around with it a little bit and see how I feel about it.

dropwhile commented 11 months ago

@alexmv I found myself with some extra time recently, and I have revisited this issue/PR, but in a slightly different format/style.

In the extra-headers brach, there is a new go-camo flag --enable-extra-headers.

When this option is supplied, extra headers may be included with the url payload as part of url generation. These extra headers will then be returned as part of the response.

These extra headers are part of an additional encoded url path component, also signed under the hmac signature. The underlying data is a json dict/map of string values.

Example standard url format:
/D23vHLFHsOhPOcvdxeoQyAJTpvM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n
Example of extra-headers component url format:
/sYpOByX0byESnG24QMUX5Ac0IXM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n/eyJjb250ZW50LWRpc3Bvc2l0aW9uIjoiYXR0YWNobWVudDsgZmlsZW5hbWU9XCJpbWFnZS5wbmdcIiJ9
The above extra-headers json component:
{
    "content-disposition": "attachment; filename=\"image.png\""
}
An possible use-case is setting content-disposition headers on specific image urls.

Here is an example, using url-tool to generate the urls.

# standard go-camo url
go-camo% ./build/bin/url-tool -k test encode -p http://example.com "http://golang.org/doc/gopher/frontpage.png"
http://example.com/D23vHLFHsOhPOcvdxeoQyAJTpvM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n

# extra-headers go-camo url
go-camo% ./build/bin/url-tool -k test encode -p http://example.com -H "content-disposition: attachment; filename=\"image.png\"" "http://golang.org/doc/gopher/frontpage.png"   
http://example.com/sYpOByX0byESnG24QMUX5Ac0IXM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n/eyJjb250ZW50LWRpc3Bvc2l0aW9uIjoiYXR0YWNobWVudDsgZmlsZW5hbWU9XCJpbWFnZS5wbmdcIiJ9

Note the hmac signature is different even thought the two origin image urls are the same. Since the "extra-headers" portion (encode json) is included under the hmac signature, this prevents anyone not knowing the hmac from modifying the response headers for clients.

Now using an example html like this:

<html>
<body>
<a href="http://127.0.0.1:8080/sYpOByX0byESnG24QMUX5Ac0IXM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n/eyJjb250ZW50LWRpc3Bvc2l0aW9uIjoiYXR0YWNobWVudDsgZmlsZW5hbWU9XCJpbWFnZS5wbmdcIiJ9" download>
  <img src="http://127.0.0.1:8080/D23vHLFHsOhPOcvdxeoQyAJTpvM/aHR0cDovL2dvbGFuZy5vcmcvZG9jL2dvcGhlci9mcm9udHBhZ2UucG5n">
</a>
</body>
</html>

I get a rendered proxied image, and when I click on it, it downloads with the specified filename (image.png in this case).

Whatever is generating the go-camo image urls (eg. some backend) can look at the original end user supplied url path filename, and specify that as the download on-click filename, if desired.

Note: A go-camo instance running without the --enable-extra-headers flag would reject (404) these new longer format urls. A go-camo instance running with the --enable-extra-headers flag would still accept urls of the current format (eg. those lacking this new component) as well as the new longer format.

Here is an example of generation of the extra-headers format in python, as another example:
https://github.com/cactus/go-camo/blob/618f375832ab6f1f766bbc07c836ac8f781b4324/examples/python3-base64.py#L19-L51

This still needs a bit more testing, but I would be fine merging the extra-headers branch if you would still find such a feature useful.

dropwhile commented 9 months ago

closing PR

cactus / go-camo

Download #57

Description

Checklist