codeforscience / webdata

Discussion on improving numeric computing/data science on the web (JavaScript, HTML5)
164 stars 3 forks source link

Browser file and network I/O #5

Open max-mapper opened 8 years ago

max-mapper commented 8 years ago

Node.js showed us that JavaScript can be a great choice for file + network programming if backed by non-blocking I/O under the hood. The way node.js does this is by interfacing with the OS in C++ code and exposing a JavaScript API. When Node was created it had the benefit of starting from scratch, and focused on these I/O interfaces in the beginning:

Each of these have JavaScript APIs, e.g. var net = require('net'), but when you actually create a TCP socket with net it uses the C++ machinery under the hood in Node to create the TCP socket in your OS, and then any data that goes and out of the socket is relayed from you to the OS and back as node Buffer objects (which just exist to efficiently hold binary data, something that JS couldn't do when node was created).

Under the hood the C++ I/O interfaces were written in a non-blocking way such that they could enable a streaming JavaScript interface to be written on top of them. Streaming support just means that you can process files in real time, and process large files without crashing the process (as you arent trying to read too much data into memory). Both of these qualities make node great for processing data.

I/O is important for a variety of data processing and data science use cases. Reading large files, downloading large files, uploading large files, writing large files, etc.

I/O in browser JavaScript is much more limited when it comes to working with network and file data. I'll list all I/O options in browsers today and describe their weaknesses:

HTTP (XHR)

XHR is the main HTTP client built in to browsers. Similar to the net module in node, XHR (specifically XHR version 2), is implemented in native (C++) code in the browser and exposed as a JavaScript API.

Major flaw: Does not support streaming data.

Say you want to download a 20GB file and write a grep function that counts how many time a certain word occurs in the file. In node this would be a 5 line program. XHR however buffers the entire response body into a single buffer. This means your browser will eventually slow to a halt or crash as your response buffer exceeds the available amount of RAM. Also accessing the data from XHR as it arrives only works if the XHR data is text, so psuedo-streaming binary data is totally impossible.

This is a subtle but important difference from the way node works, which is to split the response into many smaller buffers. Once your code has processed a buffer it can be garbage collected. However, there is no way to garbage collect processed buffers with XHR.

The only workaround to this is to use HTTP Range headers and make multiple HTTP requests for different byte ranges of the file. However this incurs a significant performance penalty, especially for TLS connections which have something like a 5RTT handshake, as well as requires the server to support Range headers (it is nowhere near universal).

HTTP (Fetch)

Fetch is a replacement for XHR/XHR2. In Chrome as of v43 it supports streaming responses! https://googlechrome.github.io/samples/fetch-api/fetch-response-stream.html. It doesn't work in Firefox yet, but is planned.

There is a 1-2 year old WHATWG initiative called Streams: https://streams.spec.whatwg.org/. It is trying to come up with a JS Streaming API that can be shared across I/O interfaces in the browser. To quote this blog post, "The Streams API is important because it allows large resources to be processed in a memory efficient way.". Fetch is the first (I think) thing to use it.

Issue: not widely implemented/spec isn't finished

I think Fetch and Streams are The Future(tm), but they are still being evaluated and designed. There is an opportunity for us to share data science/data processing use cases to the spec designers to make sure they end up making our lives more awesome.

WebSockets

WebSockets are a message based protocol on top of TCP, and are supported in most browsers today. There are issues deploying websockets on networks with proxies that don't understand how to route websocket traffic, but those issues go away if you use TLS (wss:// instead of ws://) as it causes the headers to get encrypted and the proxies pass the messages through (as it prevents them from reading the headers and getting confused).

Two nice things about WebSockets

You can read/write individual buffers out of them

This is in contrast to XHR which only lets you read or write a single buffer per request. With a WebSocket, you open a connection and can write or read as many individual buffers as you want over the lifecycle of the socket. This is much better for streaming large files either in or out of the browser, as the programmer can decide how big or small a buffer should be sliced up.

You can transfer binary data over them

Whereas XHR is very limited in regards to binary data, WebSockets can run in 'arraybuffer' mode which lets you read and write binary buffers without having to do any crazy hacks like Base64 encoding, which is very inefficient especially for large files.

Issue: no backpressure

TCP has a mechanism built in for knowing when the other side of the connection is clogged, so that the person writing data can slow down. This is very important for memory efficiency in real world networking, as a single user on a slow connection without backpressure could cause a server to have to buffer lots of data in RAM waiting for the slow user to download it.

Unfortunately WebSockets does not expose the backpressure signals from TCP :( There is an opportunity here to request that the WebSocket API be improved. We actually got the standards bodies to fix WebRTC DataChannels for the same issue: https://github.com/feross/simple-peer/issues/39

WebRTC DataChannels

DataChannels are somewhat similar to WebSockets, in fact their design was almost copy-pasted, but instead of being on top of Client-Server TCP they are on top of WebRTC PeerConnections which can be direct connections between either two browser users or a browser and a server process.

They are in nearly every browser today, the notable exception being Safari.

The two main differences between DataChannels and WebSockets are:

DataChannels are always encrypted

WebSockets use TLS and the HTTPS Certificate system to do encryption, but DataChannels have a built in P2P private key encryption that is turned on by default.

DataChannels have a reliable and unreliable mode

The networking machinery underneath WebRTC is too complicated to detail here, but whereas WebSockets are only on top of TCP (a reliable transport), DataChannels can be either reliable or unreliable depending on the underlying transport (which may be UDP or TCP).

Issue: connection overhead

Making a WebRTC connection can be relatively slow, with a ~5RTT handshake and ICE/STUN negotiation. However, Google is apparently working on a version of DataChannels powered by their new QUIC protocol which boasts 0RTT handshakes in the best case.

Issue: no backpressure

DataChannels copied WebSockets and inherited the lack of backpressure. But we got in contact with Chrome and Firefox and got them to fix the spec! https://github.com/feross/simple-peer/issues/39

File System

Around 5 years ago there was a File System W3C specification, but as of 2014 it is abandoned and there doesn't seem to be any replacement in the works.

IMO the only good part of the File System API is FileReader, and it happens to be one of the only parts that was actually ever widely implemented.

Good: reading files

FileReader lets the user select a file (or multiple files), and then browser JS gets random access to read data in the file, in the chunk size the programmer specifies. This is really nice! I wrote a module that wraps this in a stream: https://github.com/maxogden/filereader-stream

Bad: writing files

However, say you want to write a file to the users hard drive. You can ask the user to choose where they want to save the file, but the File System saveAs() method only lets you write a single buffer to the file, and there is no way to append to a file! This means you can only write files as large as will fit in ArrayBuffers, e.g. 32 bit integer buffers in RAM.

The lack of a streaming file write interface in the browser (and missing from W3C specs) is a huge issue, and needs some community championing.

IndexedDB

IndexedDB is a non-blocking key/value database available to browser JS, backed by LevelDB in Chrome and SQLite in Firefox. It's supported in most browsers.

The API is really complicated, and there are bugs between implementations in different browsers, but it is an actual database that you can actually use to store key/value data. It doesn't work for storing large data like files (for the same reasons you wouldn't store a filesystem in a MySQL table, but would use blobs instead), but works relatively well for mutable data that you want random access to. It's also non-blocking which means reading/writing to it won't cause the browser UI to freeze up.

I don't have any major criticisms of IndexedDB, other than it seems to be difficult to implement as it is a very complicated spec. If it was less overengineered I feel the inconsistencies between implementations would be easier to fix.

PHEW!

Ok that was a tour of the state of I/O in the browser as it relates to data processing/data science users.

If you have feedback, or ideas on how to improve this stuff, or questions, leave a comment below.

inexorabletash commented 8 years ago

Nit: FileReader is part of FileAPI - the spec that defines Blobs and Files, not a FileSystem spec per se. (At least, that's the state now. It may have migrated between documents at some point in the past.)

I would expect real Stream support to be bolted on to Blob/File in the nearish future. (Thanks for your wrapper!)

There is some (slow) progress being made on the FileSystem API front - the highest priority scenario - directory upload - is being worked on in a Directory Upload proposal. This builds on FileAPI to add Directories (enumeration, paths, ...) Once that's sorted out it makes a revised FileSystem API a much smaller step.

(Do note that we don't have consensus on those proposals yet, let alone implementations.)

As I said, this is a nitpick; to anyone not groveling over standards/implementation, I agree that at a high level you can contrast "HTTP" vs "Files" vs. "IndexedDB"

inexorabletash commented 8 years ago

Another nit, re: Indexed DB: "If it was less overengineered I feel the inconsistencies between implementations would be easier to fix"

While I can agree that IDB may have been overdesigned (e.g. requiring buy-in to the monotonically increasing version/upgrade scheme), the two main sources of incompatibilities in practice are:

Neither of these seem to be caused by overengineering of the design. But then again, I suppose both would be avoided if the API had been as simple as a basic async key/value store, with everything else done in user-space.

tbuchok commented 8 years ago

i owe myself a bit more time investigating, but an interesting addition may be media source extensions.

from a high-level what is encouraging is the ability to push arbitrary bytes into a stream - building an HLS or DASH player in pure JS being a canonical example of what this would enable.

if anyone's done some work in this area and can provide real info beyond my naive attempt, may add some value to this thread.

max-mapper commented 8 years ago

@inexorabletash excellent points, thanks. I've updated my summary in a few places.

Re: basic async key/value store, I agree a simpler thing would have been nice and would have still enabled many use cases. For example in Node there is a very active LevelDB ecosystem that has taken the LevelDB primitive (binary key/value storage with lexicographic indexes and forward/reverse range queries) and has built a ton of powerful abstractions on top. I have a wrapper called level-js that takes IDB and wraps a LevelDB API on top (kind of ironic as its LevelDB implemented on top of LevelDB in Chrome). I haven't personally seen many applications that use e.g. the transactions in IDB, but then again I haven't been looking for them.

schmod commented 8 years ago

Worth noting: PhantomJS doesn't have any support for IndexedDB prior to version 2 (which still isn't in widespread use yet for an odd assortment of reasons), while support in iOS is too minimal/buggy to use seriously*.

*IndexedDB can be polyfilled on iOS if you can use WebSQL as your backing store, but in my experience this always ends up being really buggy in practice.

Storing blobs in IDB works reasonably well, and there aren't necessarily better alternatives for that purpose (that I know of and/or are widely supported). Recent versions of Chrome/FF can store blobs directly in IndexedDB, so there is presumably some sort of optimization being applied to this process. If you want to store something that doesn't fit in LocalStorage, or persist data in a web worker/service worker, IndexedDB is pretty much the only option.

max-mapper commented 8 years ago

I've been benchmarking IndexedDB the last couple of days and have some notes:

max-mapper commented 8 years ago

update: since I found the 40x improvement mentioned above by accident, I started looking into why the difference in performance exists.

https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Basic_Concepts_Behind_IndexedDB#Database says "A database connection can have several active transaction associated with it at a time, so long as the writing transactions do not have overlapping scopes. " and then later "You can still start several transactions with the same scope at the same time, but they just queue up and execute one after another.".

I'm guessing the approach that was faster ended up either using unique scopes (I'm not sure what the scope rules are to be honest) and thus was able to have parallel writers, OR it was using non-unique scopes but for some reason the queue mentioned above is somehow optimizing write. Need to do further investigation

edit - after reading http://w3c.github.io/IndexedDB/ it seems that 'scope' just means that each objectStore is only allowed 1 write process at a time

max-mapper commented 8 years ago

another update: decided to compare Chrome (47) vs FF (42)

ff serial - 379KB/s ff parallel - 2230KB/s chrome serial - 58KB/s chrome parallel - 1161KB/s

The benchmarks are linked above. the serial test waits for onsuccess before queuing the next write. the parallel test doesn't wait for onsuccess.

Very interesting results. Perhaps there is a difference in the way onsuccess works for readwrite transactions between chrome and ff, e.g. maybe one is doing a disk flush and one isnt?

edit Upon further inspection I might be wrong about the difference between the benchmarks above -- will try and make a better isolated standalone benchmark so I can nail down the variables.

inexorabletash commented 8 years ago

Chrome flushes, FF does not.

We (Chrome) have "consider not flushing by default, allow flushing option" on our to-do list, but it requires some additional investigation.

On Dec 7, 2015, at 7:00 PM, maxogden notifications@github.com wrote:

another update: decided to compare Chrome (47) vs FF (42)

ff serial - 379KB/s ff parallel - 2230KB/s chrome serial - 58KB/s chrome parallel - 1161KB/s

The benchmarks are linked above. the serial test waits for onsuccess before queuing the next write. the parallel test doesn't wait for onsuccess.

Very interesting results. Perhaps there is a difference in the way onsuccess works for readwrite transactions between chrome and ff, e.g. maybe one is doing a disk flush and one isnt?

— Reply to this email directly or view it on GitHub.

binarymax commented 8 years ago

After EdgeConf in June, I wrote up my thoughts on browser file i/o here: http://max.io/articles/the-state-of-state-in-the-browser/ (the post gives lots of background and the juice is at the bottom in the 'Proposal' section)

max-mapper commented 8 years ago

@binarymax nice summary of this whole webdata repo:

screen shot 2015-12-12 at 1 29 40 pm