Closed pzwang closed 7 years ago
First a definition: this is all about improving the support for typed homogeneous array data. Let's call this a typed-array, which means e.g. Float32Array in JS, and a Numpy array in Python.
We need a format:
I can see three options now:
A: use JSON
We can write a Python and JS "extension" to store typed-arrays as a dict, e.g.: {"__array__": "LONG_STRING_OF_BASE64", "size": 512, "dtype": "float32"}
. The __array__
is a special marker than can be recognised by a decoder so this field can be decoded as a typed array. The advantage is that we still use JSON which is what ppl are familiar to. The disadvantage is that its not very efficient since the arrays need to be encoded/decoded in base64, which takes time and space.
B: use UBJSON We can (mis)use UBJSON's feature of efficiently storing homogeneous arrays to store typed-arrays. The advantage is that its binary and fast, using an existing protocol (but we do have to modify the parsers a bit). The disadvantage is that we're using UBJSON in a way that it's not really intended for. Also would be limited to 1D arrays.
C: roll our own format We could specify a new format with full support for typed arrays (and giving up JSON compatibility). If we derive it from UBJSON (which has a very smart and simple spec IMO) it would not be a huge task. The advantage is that we get exactly what we need. The disadvantage is that we would add yet another competing format.
details below
The problem with most of these is that we need something more than json. This makes these formats not very suitable, except perhaps ubjson, for which we could hijack the support for storing homogeneous arrays more efficiently as a means to store typed-arrays.
We could "extend" json by storing typed-arrays as base64 encoded strings. If we have a special encoder/decoder on both the JS and Python side, numpy arrays could transparently become typed JS arrays. See e.g. http://stackoverflow.com/a/24375113/2271927 and EJSON.
Since we're interested in storing data, we quickly end up in the more scientific formats, which are generally rather complex. Should we consider coming up with something ourselves?
cc @bryevdv we already had an issue for this. Will start looking into this next week.
I looked into compression. Not necessary per see, but it would be nice to reduce the size of transfers and exported documents. Unfortunately, you'd need a 3d party library, and these are all pretty big.
Examples for compression schemes that are also built into Python:
Other examples (would need something on the Python side):
Ok, I think I've done enough googling and reading through format specs for now. I need some help moving further. In my first post of this issue I put together an overview and I propose three options. cc @bryevdv @pzwang
@almarklein An idea that has been bandied about was to remove actual data from ColumnDataSource
, and instead have data sources be a lightweight objects that are configured with a "remote" actual data store. This could be a URL to a REST endpoint, or a Blaze server. Or, it could be a reference to a "local remote" data store that lives in the browser but is separate from all the other Bokeh models. I think if we did this kind of separation of the actual data payload from the lightweight data source model, it woudl allow all the normal bokeh objects to remain simple plain JSON, and then just the data columns could be transmitted separately in an enhanced JSON, or non-JSON format? Like the idea of making all the Bokeh models "lightweight" on its own, but it if would also ihelp preserve the simple JSON representation for the majority of things that would be another big point in favor for me. Thoughts @bokeh/core
Other comments: I'm OK with 1d only, if we want to do something simple like storing a simple shape in some conventional way, that would be ok too. Almost all the use-cases in bokeh are around tabular columns so that is what we should optimize for.
If we separate the data, as you suggest, there is no longer a need for a structured data format, so we could probably do with something simpler in that case. I'm interested in hearing more about this ...
@almarklein we should probably have a call sometime soon. My current plan is to implement a binary protocol over web sockets. Doing this, it seems possible to send NumPy/Pandas data directly into a JS ArrayBuffer, which can have a type array view on it without any copying. I'd like to get more input for people (and possibly help as well)
To add a little more I intend to make the wire-protocol an implementation detail, so that we can change things later if we need to. For instance, msgpack has had some integration into blaze-server
so it might make sense to look at. But for now I am just going to to arr.tobytes
along with a head that has type/shape info, which works perfectly fine.
If we're storing/sending the data separate, we only need to store a blob, a shape and a type, so a simple dedicated format is fine IMO. Msgpack seems overkill unless we want to send the data along with all the model stuff that we now send via json.
I assume we can use the same format to store data in static HTML (but base64 encoded)?
Interesting stuff... @bryevdv, do you have a branch were you prototyped this?
@bryevdv @almarklein I'm working out exactly the same data exchange problem for interoperability between numpy and weblas. I came to the same conclusion that you guys did: bytes with type and shape should be sufficient.
There aren't a ton of ways to do that, but I'd like to be compatible with you guys from the start. Do you have code or a simple spec you can share?
I haven't worked anything specific out yet, past just the "proof of concept" which was nothing more than sending arr. tobytes
over a web socket as a binary message. But I'm certainly open to any discussion.
@damianavila I found it, it's not much:
from __future__ import print_function
from flask import Flask, render_template
from tornado.wsgi import WSGIContainer
from tornado.web import Application, FallbackHandler
from tornado.websocket import WebSocketHandler
from tornado.ioloop import IOLoop
import numpy as np
arr = np.arange(10, dtype=np.float32)
arr_bytes = arr.tobytes()
shp = [0,0,0][:len(arr.shape)] = arr.shape
meta = {
"size": len(arr_bytes),
"shape": shp,
"type": "float32",
}
class WebSocket(WebSocketHandler):
def open(self):
print("Socket opened.")
def on_message(self, message):
self.write_message("\0")
self.write_message(meta)
self.write_message(arr_bytes, binary=True)
def on_close(self):
print("Socket closed.")
app = Flask('flasknado')
@app.route('/')
def index():
return render_template('index.html')
if __name__ == "__main__":
container = WSGIContainer(app)
server = Application([
(r'/array/', WebSocket),
(r'.*', FallbackHandler, dict(fallback=container))
])
server.listen(8080)
IOLoop.instance().start()
and then something like this:
/* Client-side component for the Flasknado! demo application. */
var socket = null;
var state = 0;
var header = null;
var array = null;
$(document).ready(function() {
socket = new WebSocket("ws://" + document.domain + ":8080/array/");
socket.binaryType = 'arraybuffer';
socket.onopen = function() {
socket.send("Joined");
}
socket.onmessage = function(message) {
if (state == 0 && message.data == "\0") {
state = 1;
}
else if (state == 1) {
header = message.data;
state = 2;
}
else if (state == 2) {
array = new Float32Array(message.data);
state = 0;
debugger;
}
}
});
function submit() {
var text = $("input#message").val();
socket.send(text);
$("input#message").val('');
Look like datashader could use this as well, if it would reduce JSON comm overhead : https://github.com/bokeh/datashader/issues/49#issuecomment-181450958 @brendancol @philippjfr
@bryevdv thanks! that's about where I am too (though I'm serializing to disk and serving with http-server). Working on something simple (based on npy) to augment this. Will keep you guys in the loop, if you're interested.
here's my client side code
var xhr = new XMLHttpRequest();
var data = null;
xhr.open("GET", "arr.buf", true);
xhr.responseType = "arraybuffer";
xhr.onload = function (e) {
var arrayBuffer = xhr.response; // Note: not xhr.responseText
if (arrayBuffer) {
data = new Float32Array(arrayBuffer);
}
};
xhr.send(null);
and here's the snippet for serializing
# given array 'a'
f = open('./arr.buf', 'wb')
f.write(a.astype(np.float32).tostring())
f.close
@datnamer in my tests, serializing (float32
) to disk as bytes (instead of json) reduces to 1/5 the size. very significant for me.
Size is an issue (though there are probably cases where the size actually increases, an array of small ints, e.g) but being able to skip the encoding entirely and get the data into a typed array view directly is another huge benefit. I should also clarify, we do have a higher level protocol for Bokeh that allows for multipart messages. My intent was to send each buffer as a separate message part to avoid unnecessary copying. It's this "just for the array" part of the protocol that has not been fleshed out. Any input is certainly very welcome.
Thanks. I'm just beggining to feel my way around the space. It's great to hear how other smart people have solved the problem.
I like your point about just getting the data into a typed array as quickly as possible. I was also considering a separate descriptor file as an option for just that reason. Another option I like a lot (for the Ajax/HTTP case) is custom headers. Maybe using a custom mime type and an extra header field for shape. It would be great if this could be made to play well with caching, so that reshaping the same data didn't trigger a new download.
Love to hear thoughts on these ideas.
— Sent from Mailbox
On Tue, Feb 9, 2016 at 9:37 AM, Bryan Van de Ven notifications@github.com wrote:
Size is a big issue but being able to skip the encoding entirely and get the data into a typed array view directly is another huge benefit. I should also clarify, we do have a higher level protocol for Bokeh that allows for multipart messages. My intent was to send each buffer as a separate message part to avoid unnecessary copying. It's this "just for the array" part of the protocol that has not been fleshed out. Any input is certainly very welcome.
Reply to this email directly or view it on GitHub: https://github.com/bokeh/bokeh/issues/2204#issuecomment-181917569
Would be really great to hear about progress in this space. Am trying to prepare some stuff for publication and Bokeh is going to play a part in that. However, have noticed how big these HTML files are and performance is something we are seeking to improve.
Using something like bson
that can be serialized easily between pure Python or the Python Mongo API and JavaScript as well as go into MongoDB, seems pretty nice all around given what you want to achieve. If the format is too flexible, I suppose one can restrict themselves to the relevant subset that will work. Though maybe there are other constraints that I'm unaware of.
While compression is definitely a laudable goal, my recommendation would be to think about it after choosing a binary format that works. Compression is always a game of trade-offs and what one person is willing to give up another might not. So perhaps having a simple plugin interface for different compression options would be valuable to avoid being too attached to a particular one. Though I would note one general constraint that seems to be important to Bokeh (being an interactive data visualization program) is speed. If too much time is spent doing decompression, it can hurt user experience.
Also Numscrypt may be worth looking at as far as having an array in JavaScript/TypeScript.
This is a little orthogonal from the serialization issue. However, scijs provides support for ndarrays
in JavaScript along with a host of functions to work with and compute things from them. Probably worth a look at least.
Also numjs
, which builds on scijs
may give a more NumPy-like feeling when working in JavaScript.
some additional info in this experimental PR: https://github.com/bokeh/bokeh/pull/5429
basically sing a simple base 64 encode seems to give a ~3x improvement over non-websocket type renders, and a 14x speedup over push notebooks. So I think we will just start with a base64 approach, the trick is making it work completely over all the different possible ways to send end embed and transmit things... I think there will need to be some comprehensive work starting from the lowest level encoders and also some consolidation of how push_notebook
works.
There are still possibilities of exploring other encodings, or multi-part messages in the context of the server. But the work in #5544 provides a clear improvement, and also sets a foundation. Future work should have new issues.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/DataView