Open ellisonbg opened 7 years ago
Thank you for outlining this issue.
I wonder if Apache Arrow for JS would be a good intermediary between the data formats and the providers. Hopefully this would reduce the memory footprint of large data in the browser, make computation on it more performant, and reuse the work they have already done to load data.
It is unclear to me if that package works in the browser or if it requires some Node libraries. I will look into those more. Assuming it is possible to get it working in the browser, one possible data pipeline could be:
I am not sure where this backend code would live.
Our use reminds me of the Plasma store:
This blog post presents Plasma, an in-memory object store that is being developed as part of Apache Arrow. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime. [...] Expensive serialization and deserialization as well as data copying are a common performance bottleneck in distributed computing.
I don't think it would make sense to use Plasma in the frontend, but its existence suggests to me that Apache Arrow works well as an in memory representation of immutable data that needs to be accessed by multiple different providers at once.
Eventually yes, this is very much the direction we are thinking. Our plan is to create a uniform and consistent set of tabular data APIs in JupyterLab. We have a start of of that in phosphor here:
https://github.com/phosphorjs/phosphor/blob/master/packages/datagrid/src/datamodel.ts
But there are number of additional things we need to add before we can really start to depend on it for this like visualization:
The pieces are definitely starting to fall into place for this, and I think we can start to build with this goal in mind, even if we can't start to use arrow immediately. A good starting point would be to begin using the JSON table schema as the in memory format for now. A couple of reasons for that initially:
But, this does clarify that there are actually two questions here:
We just opened a number of issues to lay out the different possible ways Voyager can be opened. I propose we break up "opening voyager with some data" into two different parts:
This intermediate data format could be any of (from most to least specified):
InlineData
in vega lite).Parsing the data before passing it into Voyager gives us flexibility in the future to potentially cache or deduplicate the data in memory.
Actually, instead will try to pass inputs in as close to their input formats as possible and let Voyager parse them properly. Voyager supports the Data
interface in vega lite as well as vega-lite specs. We won't worry about importing Voyager redux state for now, since the format of that is still in flux.
This is an issue to begin discussing and thinking about what data sources we would want to support. There are maybe three related questions:
For the data formats, here is a start:
For providers of those formats:
For the UX: