Open max-mapper opened 8 years ago
Nit: FileReader
is part of FileAPI - the spec that defines Blob
s and File
s, not a FileSystem spec per se. (At least, that's the state now. It may have migrated between documents at some point in the past.)
I would expect real Stream
support to be bolted on to Blob
/File
in the nearish future. (Thanks for your wrapper!)
There is some (slow) progress being made on the FileSystem API front - the highest priority scenario - directory upload - is being worked on in a Directory Upload proposal. This builds on FileAPI to add Directories (enumeration, paths, ...) Once that's sorted out it makes a revised FileSystem API a much smaller step.
(Do note that we don't have consensus on those proposals yet, let alone implementations.)
As I said, this is a nitpick; to anyone not groveling over standards/implementation, I agree that at a high level you can contrast "HTTP" vs "Files" vs. "IndexedDB"
Another nit, re: Indexed DB: "If it was less overengineered I feel the inconsistencies between implementations would be easier to fix"
While I can agree that IDB may have been overdesigned (e.g. requiring buy-in to the monotonically increasing version/upgrade scheme), the two main sources of incompatibilities in practice are:
Neither of these seem to be caused by overengineering of the design. But then again, I suppose both would be avoided if the API had been as simple as a basic async key/value store, with everything else done in user-space.
i owe myself a bit more time investigating, but an interesting addition may be media source extensions.
from a high-level what is encouraging is the ability to push arbitrary bytes into a stream - building an HLS or DASH player in pure JS being a canonical example of what this would enable.
if anyone's done some work in this area and can provide real info beyond my naive attempt, may add some value to this thread.
@inexorabletash excellent points, thanks. I've updated my summary in a few places.
Re: basic async key/value store, I agree a simpler thing would have been nice and would have still enabled many use cases. For example in Node there is a very active LevelDB ecosystem that has taken the LevelDB primitive (binary key/value storage with lexicographic indexes and forward/reverse range queries) and has built a ton of powerful abstractions on top. I have a wrapper called level-js that takes IDB and wraps a LevelDB API on top (kind of ironic as its LevelDB implemented on top of LevelDB in Chrome). I haven't personally seen many applications that use e.g. the transactions in IDB, but then again I haven't been looking for them.
Worth noting: PhantomJS doesn't have any support for IndexedDB prior to version 2 (which still isn't in widespread use yet for an odd assortment of reasons), while support in iOS is too minimal/buggy to use seriously*.
*IndexedDB can be polyfilled on iOS if you can use WebSQL as your backing store, but in my experience this always ends up being really buggy in practice.
Storing blobs in IDB works reasonably well, and there aren't necessarily better alternatives for that purpose (that I know of and/or are widely supported). Recent versions of Chrome/FF can store blobs directly in IndexedDB, so there is presumably some sort of optimization being applied to this process. If you want to store something that doesn't fit in LocalStorage, or persist data in a web worker/service worker, IndexedDB is pretty much the only option.
I've been benchmarking IndexedDB the last couple of days and have some notes:
readwriteflush
transaction mode that lets you guarantee durability of data.isDatabaseBusy
method or something similar. this would allow for applications to get essentially backpressure from leveldb as they write data. however, this would be less important than getting a proper streaming file write API, as people then wouldn't have to store files in a databaseupdate: since I found the 40x improvement mentioned above by accident, I started looking into why the difference in performance exists.
https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Basic_Concepts_Behind_IndexedDB#Database says "A database connection can have several active transaction associated with it at a time, so long as the writing transactions do not have overlapping scopes. " and then later "You can still start several transactions with the same scope at the same time, but they just queue up and execute one after another.".
I'm guessing the approach that was faster ended up either using unique scopes (I'm not sure what the scope rules are to be honest) and thus was able to have parallel writers, OR it was using non-unique scopes but for some reason the queue mentioned above is somehow optimizing write. Need to do further investigation
edit - after reading http://w3c.github.io/IndexedDB/ it seems that 'scope' just means that each objectStore is only allowed 1 write process at a time
another update: decided to compare Chrome (47) vs FF (42)
ff serial - 379KB/s ff parallel - 2230KB/s chrome serial - 58KB/s chrome parallel - 1161KB/s
The benchmarks are linked above. the serial test waits for onsuccess
before queuing the next write. the parallel test doesn't wait for onsuccess
.
Very interesting results. Perhaps there is a difference in the way onsuccess
works for readwrite
transactions between chrome and ff, e.g. maybe one is doing a disk flush and one isnt?
edit Upon further inspection I might be wrong about the difference between the benchmarks above -- will try and make a better isolated standalone benchmark so I can nail down the variables.
Chrome flushes, FF does not.
We (Chrome) have "consider not flushing by default, allow flushing option" on our to-do list, but it requires some additional investigation.
On Dec 7, 2015, at 7:00 PM, maxogden notifications@github.com wrote:
another update: decided to compare Chrome (47) vs FF (42)
ff serial - 379KB/s ff parallel - 2230KB/s chrome serial - 58KB/s chrome parallel - 1161KB/s
The benchmarks are linked above. the serial test waits for onsuccess before queuing the next write. the parallel test doesn't wait for onsuccess.
Very interesting results. Perhaps there is a difference in the way onsuccess works for readwrite transactions between chrome and ff, e.g. maybe one is doing a disk flush and one isnt?
— Reply to this email directly or view it on GitHub.
After EdgeConf in June, I wrote up my thoughts on browser file i/o here: http://max.io/articles/the-state-of-state-in-the-browser/ (the post gives lots of background and the juice is at the bottom in the 'Proposal' section)
@binarymax nice summary of this whole webdata repo:
Node.js showed us that JavaScript can be a great choice for file + network programming if backed by non-blocking I/O under the hood. The way node.js does this is by interfacing with the OS in C++ code and exposing a JavaScript API. When Node was created it had the benefit of starting from scratch, and focused on these I/O interfaces in the beginning:
Each of these have JavaScript APIs, e.g.
var net = require('net')
, but when you actually create a TCP socket withnet
it uses the C++ machinery under the hood in Node to create the TCP socket in your OS, and then any data that goes and out of the socket is relayed from you to the OS and back as nodeBuffer
objects (which just exist to efficiently hold binary data, something that JS couldn't do when node was created).Under the hood the C++ I/O interfaces were written in a non-blocking way such that they could enable a streaming JavaScript interface to be written on top of them. Streaming support just means that you can process files in real time, and process large files without crashing the process (as you arent trying to read too much data into memory). Both of these qualities make node great for processing data.
I/O is important for a variety of data processing and data science use cases. Reading large files, downloading large files, uploading large files, writing large files, etc.
I/O in browser JavaScript is much more limited when it comes to working with network and file data. I'll list all I/O options in browsers today and describe their weaknesses:
HTTP (XHR)
XHR is the main HTTP client built in to browsers. Similar to the
net
module in node, XHR (specifically XHR version 2), is implemented in native (C++) code in the browser and exposed as a JavaScript API.Major flaw: Does not support streaming data.
Say you want to download a 20GB file and write a grep function that counts how many time a certain word occurs in the file. In node this would be a 5 line program. XHR however buffers the entire response body into a single buffer. This means your browser will eventually slow to a halt or crash as your response buffer exceeds the available amount of RAM. Also accessing the data from XHR as it arrives only works if the XHR data is text, so psuedo-streaming binary data is totally impossible.
This is a subtle but important difference from the way node works, which is to split the response into many smaller buffers. Once your code has processed a buffer it can be garbage collected. However, there is no way to garbage collect processed buffers with XHR.
The only workaround to this is to use HTTP Range headers and make multiple HTTP requests for different byte ranges of the file. However this incurs a significant performance penalty, especially for TLS connections which have something like a 5RTT handshake, as well as requires the server to support Range headers (it is nowhere near universal).
HTTP (Fetch)
Fetch is a replacement for XHR/XHR2. In Chrome as of v43 it supports streaming responses! https://googlechrome.github.io/samples/fetch-api/fetch-response-stream.html. It doesn't work in Firefox yet, but is planned.
There is a 1-2 year old WHATWG initiative called Streams: https://streams.spec.whatwg.org/. It is trying to come up with a JS Streaming API that can be shared across I/O interfaces in the browser. To quote this blog post, "The Streams API is important because it allows large resources to be processed in a memory efficient way.". Fetch is the first (I think) thing to use it.
Issue: not widely implemented/spec isn't finished
I think Fetch and Streams are The Future(tm), but they are still being evaluated and designed. There is an opportunity for us to share data science/data processing use cases to the spec designers to make sure they end up making our lives more awesome.
WebSockets
WebSockets are a message based protocol on top of TCP, and are supported in most browsers today. There are issues deploying websockets on networks with proxies that don't understand how to route websocket traffic, but those issues go away if you use TLS (
wss://
instead ofws://
) as it causes the headers to get encrypted and the proxies pass the messages through (as it prevents them from reading the headers and getting confused).Two nice things about WebSockets
You can read/write individual buffers out of them
This is in contrast to XHR which only lets you read or write a single buffer per request. With a WebSocket, you open a connection and can write or read as many individual buffers as you want over the lifecycle of the socket. This is much better for streaming large files either in or out of the browser, as the programmer can decide how big or small a buffer should be sliced up.
You can transfer binary data over them
Whereas XHR is very limited in regards to binary data, WebSockets can run in 'arraybuffer' mode which lets you read and write binary buffers without having to do any crazy hacks like Base64 encoding, which is very inefficient especially for large files.
Issue: no backpressure
TCP has a mechanism built in for knowing when the other side of the connection is clogged, so that the person writing data can slow down. This is very important for memory efficiency in real world networking, as a single user on a slow connection without backpressure could cause a server to have to buffer lots of data in RAM waiting for the slow user to download it.
Unfortunately WebSockets does not expose the backpressure signals from TCP :( There is an opportunity here to request that the WebSocket API be improved. We actually got the standards bodies to fix WebRTC DataChannels for the same issue: https://github.com/feross/simple-peer/issues/39
WebRTC DataChannels
DataChannels are somewhat similar to WebSockets, in fact their design was almost copy-pasted, but instead of being on top of Client-Server TCP they are on top of WebRTC PeerConnections which can be direct connections between either two browser users or a browser and a server process.
They are in nearly every browser today, the notable exception being Safari.
The two main differences between DataChannels and WebSockets are:
DataChannels are always encrypted
WebSockets use TLS and the HTTPS Certificate system to do encryption, but DataChannels have a built in P2P private key encryption that is turned on by default.
DataChannels have a reliable and unreliable mode
The networking machinery underneath WebRTC is too complicated to detail here, but whereas WebSockets are only on top of TCP (a reliable transport), DataChannels can be either reliable or unreliable depending on the underlying transport (which may be UDP or TCP).
Issue: connection overhead
Making a WebRTC connection can be relatively slow, with a ~5RTT handshake and ICE/STUN negotiation. However, Google is apparently working on a version of DataChannels powered by their new QUIC protocol which boasts 0RTT handshakes in the best case.
Issue: no backpressure
DataChannels copied WebSockets and inherited the lack of backpressure. But we got in contact with Chrome and Firefox and got them to fix the spec! https://github.com/feross/simple-peer/issues/39
File System
Around 5 years ago there was a File System W3C specification, but as of 2014 it is abandoned and there doesn't seem to be any replacement in the works.
IMO the only good part of the File System API is
FileReader
, and it happens to be one of the only parts that was actually ever widely implemented.Good: reading files
FileReader lets the user select a file (or multiple files), and then browser JS gets random access to read data in the file, in the chunk size the programmer specifies. This is really nice! I wrote a module that wraps this in a stream: https://github.com/maxogden/filereader-stream
Bad: writing files
However, say you want to write a file to the users hard drive. You can ask the user to choose where they want to save the file, but the File System
saveAs()
method only lets you write a single buffer to the file, and there is no way to append to a file! This means you can only write files as large as will fit inArrayBuffers
, e.g. 32 bit integer buffers in RAM.The lack of a streaming file write interface in the browser (and missing from W3C specs) is a huge issue, and needs some community championing.
IndexedDB
IndexedDB is a non-blocking key/value database available to browser JS, backed by LevelDB in Chrome and SQLite in Firefox. It's supported in most browsers.
The API is really complicated, and there are bugs between implementations in different browsers, but it is an actual database that you can actually use to store key/value data. It doesn't work for storing large data like files (for the same reasons you wouldn't store a filesystem in a MySQL table, but would use blobs instead), but works relatively well for mutable data that you want random access to. It's also non-blocking which means reading/writing to it won't cause the browser UI to freeze up.
I don't have any major criticisms of IndexedDB, other than it seems to be difficult to implement as it is a very complicated spec. If it was less overengineered I feel the inconsistencies between implementations would be easier to fix.
PHEW!
Ok that was a tour of the state of I/O in the browser as it relates to data processing/data science users.
If you have feedback, or ideas on how to improve this stuff, or questions, leave a comment below.