Content sniffing implementation details

Rob--W commented 7 years ago

Last month I spent two weeks on implementing content sniffing, which was behaviorally identical to Firefox's implementation. Unfortunately, I lost the laptop before I pushed the changes, so I will document what's necessary in case anyone (maybe me?) is interested in implementing a content sniffer.

The full implementation (code and comments) consisted of about 3 - 5k lines of JS code (unit tests were written but not included in this count).

The implementation details are as follows (this is a brain dump from my recollection):

The new webRequest.filterResponseData API can be used to inspect and modify the response body. This filter is activated after the webRequest.onHeadersReceived event stage, for http(s) only. There are several bugs, see the list of bugs that I appended to the bug that introduced this new webRequest method : https://bugzilla.mozilla.org/show_bug.cgi?id=1255894#a48785057_447061
Content sniffing happens in two stages (much more details below):
- At first entries in the NS_CONTENT_SNIFFER_CATEGORY (aka "net-content-sniffers") category are used to estimate the MIME type.
- If unknown, then basically the logic of nsUnknownDecoder::DetermineContentType is used (which includes entries from the NS_DATA_SNIFFER_CATEGORY (aka "content-sniffing-services") category.
The extension can force a specific content type after the onHeadersReceived by using the webRequest.filterResponseData to change the response body. For some types, prepending magic bytes can be done in a transparent way (e.g. HTML and plain text). For others, the response can be forced to HTML that in turn embeds a full-page iframe that requests the original URL (with cache buster). The extension can then intercept this request and pipe the original response to this new request. The reason for using an iframe is to ensure that the original response stream is not aborted. If the original response is not important, redirecting would work too.
Basically, Firefox follows the following logic to determine what to do with a givien response body
- Extract the MIME type from the Content-Type header.
- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/netwerk/base/nsURLHelper.cpp#978-1030
- If the MIME is not set or an empty string, treat it as "application/x-unknown-content-type" and continue at the next bullet point.
- If the MIME is supported by Firefox, display inline and don't sniff (follow the logic at nsDocumentOpenInfo::DispatchContent as I mentioned at )https://github.com/Rob--W/open-in-browser/issues/1#issuecomment-331710653)
- Exception: for the text/plain, application/octet-stream and application/x-unknown-content-type MIME types, Firefox MAY activate content sniffing, and open a download dialog even if the content would otherwise be displayed inline (text/plain), or display the content inline even though the content usually triggers a download dialog (application/octet-stream).
- If the MIME is not recognized by Firefox, open a download dialog.
- If the MIME is application/octet-stream or application/x-unknown-content-type, perform media sniffing:
- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210
- Note: If a document was sniffed as media, Firefox will immediately switch to a document, and the webRequest.filterResponseData method can NOT be used to modify the response stream. To replace the document, you must run a content script in this new media document.
- If the MIME is text/html, application/octet-stream or containing "xml", then the feed sniffer is activated.
- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/browser/components/feeds/nsFeedSniffer.cpp#206-336
- Note: I did not implement this because of the rare conditions, and the fact that the type was already inline (I only need to implement content sniffing if the type is potentially going to display a download dialog, since Open in Browser is only relevant for that situation).
- If the Content-Type is a case-sensitive match for text/plain, text/plain; charset=ISO-8859-1, text/plain; charset=iso-8859-1 or text/plain; charset=UTF-8, AND the Content-Encoding request header is NOT set, then the sniffer will either force a download dialog or display inline:
- Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#895-943
- Basically, if starting with an unicode BOM, or the first 512 bytes (or less if the response ends early) only consists of text characters: Treat as text. Otherwise application/octet-stream = download dialog.
  - Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#666-714
- If the MIME is "application/x-unknown-content-type" (or empty, as mentioned before), sniff magic bytes.
- Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#434-530
- Basically, the MIME is found in the following order:
  1. Look at magic bytes.
  2. Call the sniffers in the NS_DATA_SNIFFER_CATEGORY (aka "content-sniffing-services") category
    - Media sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210 (complicated - magic bytes and structure parsing)
    - Image sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/image/imgLoader.cpp#2646-2701 (simple - magic bytes only)
  3. Try HTML sniffing.
  4. Try sniffing from the URL.
  5. Fall back to the same method as text/plain sniffing (which would result in text/plain or application/octet-stream).

Other notes relevant for the implementation:

Content sniffing relies on up to 512 bytes of data, but the media sniffer may try to use more if available.
At least for text and HTML, Firefox will only display the response after 512 bytes of data have been written (or 1024, I don't remember).
For images and media, Firefox will switch to a special image/media document upon detecting the type (typically via magic bytes; for media sniffer more than magic bytes).
There is a draft for a specification at https://mimesniff.spec.whatwg.org/. This specification is close to Firefox's content sniffing. It does have any mention of media sniffing for application/octet-stream, and neither mentions the special application/x-unknown-content-type (this MIME is an artefact of Firefox's implementation; internally it represents the default value for a MIME type in a HTTP channel).
Character encoding should be respected/supported. For text/plain the UTF-8 and UTF-16 BOM can be used. For text/html, the content can be transcoded via the TextDecoder/TextEncoder APIs (except for UTF-16, which should not be used for HTML anyway).

Bugs in the webRequest.filterResponseData API that I haven't reported upstream (yet?):

If the Content-Type is application/x-unknown-content-type and the response is content-encoded, then the filtered response must also be encoded using the same type (e.g. gzipped) (for other types, e.g. text/html, the encoding is transparent, i.e. the value of the Content-Encoding header does not matter). The easiest way around this is to remove the Accept-Encoding request header or the Content-Encoding response header (or set it to "identity"). The more difficult way to get around this is to implement gzipping (and possibly other (obscure) encoding schemes such as deflate/brotli).
- If a StreamFilter is closed, Firefox will always commit a navigation to a new document, even if no data was written to that StreamFilter, and even if the tab/frame has navigated to a different page. The only work-around that I could think of is to keep the StreamFilter open forever (yuck).

def00111 commented 6 years ago

Can you look here? https://bugzilla.mozilla.org/show_bug.cgi?id=1287264

Rob--W commented 6 years ago

@def00111 I looked (and I filed a new feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479). Why did you want me to look at that bug?

def00111 commented 6 years ago

Why did you want me to look at that bug?

I just want to have you look at this bug :)

Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?

[1] https://dxr.mozilla.org/mozilla-central/rev/2386800ec051598ff4dd42da1118abcf05299fc1/netwerk/base/nsIChannel.idl#327

def00111 commented 6 years ago

I also have another idea. Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]? To get the filename from download attribute? Like with Content-Disposition header in webRequest.onHeadersReceived [3]?

Look here please: https://github.com/def00111/always-preview/blob/master/content.js

[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#attr-download [2] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onBeforeRequest#details [3] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onHeadersReceived

Rob--W commented 6 years ago

Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?

This extension is a very specialized use case. While having such a property would make the life of me as an extension developer easier, I don't think that that convenience outperforms the maintenance cost of exposing the info through the webRequest extension API. Especially since it can fully be implemented in JavaScript with minimal performance impact - https://github.com/Rob--W/open-in-browser/blob/05b80a3ce151737cfc7735eb1a714dfa84f3e3a5/extension/content-disposition.js

Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]?

This, on the other hand, could be a good reason to support the API enhancement. But...: <a download> does not work for cross-origin resources, only same-origin resources. Furthermore, <a download> is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of <a download> is limited.

In the case of <a download> to a same-origin resource without Content-Disposition response header (which I presume is rare), users can just open the link in a new tab to get the dialog if they want to view it inline or trigger an Open in Browser dialog). In the worst case (e.g. if the link is not visible), then they can use the extension menu in the Tools menu to force the dialog to appear anyway.

I appreciate your comments, but I'd like to keep the comments here on-topic. If you have more to say (unrelated to content sniffing), please open a new issue or continue via e-mail.

def00111 commented 6 years ago

is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of is limited.

This page: https://atpscan.global.hornetsecurity.com/safe_download.php?uri=aHR0cHM6Ly93d3cuc3dwLWJlcmxpbi5vcmcvZmlsZWFkbWluL2NvbnRlbnRzL3Byb2R1Y3RzL2FrdHVlbGwvMjAxNUEwM193Z24ucGRm&cd=MjAxNWEwM193Z24ucGRm&type=dat

def00111 commented 6 years ago

Can i use content-disposition.js [1] in my add-on?

[1] https://github.com/Rob--W/open-in-browser/blob/05b80a3ce151737cfc7735eb1a714dfa84f3e3a5/extension/content-disposition.js

def00111 commented 6 years ago

Is this the same what firefox does?

Rob--W commented 6 years ago

Can i use content-disposition.js [1] in my add-on?

Yes. When you add a commit in your repo, do link back to the original source in the commit description. Then in the future it will be easier for others to check whether the implementation is still up-to-date.

Is this the same what firefox does?

Yes, except for a few cases of malformed response headers (I don't think that you will ever find these in the wild). See the commit description and unit tests from https://github.com/Rob--W/open-in-browser/commit/6f3bbb8bbfc1e3e943200fffdb68d35075e82ddd

Rob--W / open-in-browser

Content sniffing implementation details #5