What data format/providers to support and how?

ellisonbg commented 7 years ago

This is an issue to begin discussing and thinking about what data sources we would want to support. There are maybe three related questions:

What data formats?
What different parts of JupyterLab (file system, notebooks) would be able to "provide" data in those formats.
What should the UX be?

For the data formats, here is a start:

CSV
JSON Table Schema (https://specs.frictionlessdata.io/table-schema/)
Vega-Lite (with embedded data, open using the Vega-Lite JSON to populate the UI)

For providers of those formats:

Filesystem (all the above)
Notebooks (MIME types for the above)
Datagrid (It already uses the JSON table schema internally)

For the UX:

Filesystem:
- Use JupyterLab's "Open With..." to enable Voyager to open these file types (already implemented)
Notebook:
- Some sort of UI that monitors notebooks in JupyterLab watching for outputs with the needed MIME types.
- Context menu for a cell with that output and see "Visualize in Voyager."
Datagrid:
- Either monitor or context menu.

saulshanabrook commented 7 years ago

Thank you for outlining this issue.

I wonder if Apache Arrow for JS would be a good intermediary between the data formats and the providers. Hopefully this would reduce the memory footprint of large data in the browser, make computation on it more performant, and reuse the work they have already done to load data.

It is unclear to me if that package works in the browser or if it requires some Node libraries. I will look into those more. Assuming it is possible to get it working in the browser, one possible data pipeline could be:

You open one of the files or view a Pandas dataframe in the UI.
If it already already loaded into memory in the browser, skip to step 7.
On the server, the file is loaded into memory in Arrow (if it isn't already backed by Arrow).
Serialize that into the Arrow binary format.
Send that serialized file to the browser.
Parse the serialized file with Apache Arrow in the browser, store it globally.
Pass a reference to the Apache Arrow file to whatever provider needs it.

I am not sure where this backend code would live.

Our use reminds me of the Plasma store:

This blog post presents Plasma, an in-memory object store that is being developed as part of Apache Arrow. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime. [...] Expensive serialization and deserialization as well as data copying are a common performance bottleneck in distributed computing.

I don't think it would make sense to use Plasma in the frontend, but its existence suggests to me that Apache Arrow works well as an in memory representation of immutable data that needs to be accessed by multiple different providers at once.

ellisonbg commented 7 years ago

Eventually yes, this is very much the direction we are thinking. Our plan is to create a uniform and consistent set of tabular data APIs in JupyterLab. We have a start of of that in phosphor here:

https://github.com/phosphorjs/phosphor/blob/master/packages/datagrid/src/datamodel.ts

But there are number of additional things we need to add before we can really start to depend on it for this like visualization:

Single table SQL operations (sorting, filtering rows, column selection, groupby)
More infrastructure on the server so we can build a set of standard protocols for shipping arrow over the network and plugging into the frontend data APIs.
The ability for this to work on larger than memory data tables. The core APIs already support this, but we will need to abstract all the single table data transformations over the network as well.

The pieces are definitely starting to fall into place for this, and I think we can start to build with this goal in mind, even if we can't start to use arrow immediately. A good starting point would be to begin using the JSON table schema as the in memory format for now. A couple of reasons for that initially:

The datagrid already supports it
Pandas supports it
It is almost trivial to convert the data to what voyager/vega-lite is expecting.

But, this does clarify that there are actually two questions here:

What in memory (in JupyterLab) format to use
What input (from file system or notebook output) formats to support.

saulshanabrook commented 7 years ago

We just opened a number of issues to lay out the different possible ways Voyager can be opened. I propose we break up "opening voyager with some data" into two different parts:

Choose data in Jupyterlab and convert to an intermediate data format.
Pass intermediate data format into voyager to start it.

This intermediate data format could be any of (from most to least specified):

Full redux state of voyager
Vega lite specification
Tabular inline data, an array of objects, where each object is a row in the table (corresponds to InlineData in vega lite).

Parsing the data before passing it into Voyager gives us flexibility in the future to potentially cache or deduplicate the data in memory.

saulshanabrook commented 7 years ago

Actually, instead will try to pass inputs in as close to their input formats as possible and let Voyager parse them properly. Voyager supports the Data interface in vega lite as well as vega-lite specs. We won't worry about importing Voyager redux state for now, since the format of that is still in flux.

altair-viz / jupyterlab_voyager

What data format/providers to support and how? #6