altair-viz / jupyterlab_voyager

JupyterLab extension visualize data with Voyager
BSD 3-Clause "New" or "Revised" License
298 stars 35 forks source link

What data format/providers to support and how? #6

Open ellisonbg opened 7 years ago

ellisonbg commented 7 years ago

This is an issue to begin discussing and thinking about what data sources we would want to support. There are maybe three related questions:

For the data formats, here is a start:

For providers of those formats:

For the UX:

saulshanabrook commented 7 years ago

Thank you for outlining this issue.

I wonder if Apache Arrow for JS would be a good intermediary between the data formats and the providers. Hopefully this would reduce the memory footprint of large data in the browser, make computation on it more performant, and reuse the work they have already done to load data.

It is unclear to me if that package works in the browser or if it requires some Node libraries. I will look into those more. Assuming it is possible to get it working in the browser, one possible data pipeline could be:

  1. You open one of the files or view a Pandas dataframe in the UI.
  2. If it already already loaded into memory in the browser, skip to step 7.
  3. On the server, the file is loaded into memory in Arrow (if it isn't already backed by Arrow).
  4. Serialize that into the Arrow binary format.
  5. Send that serialized file to the browser.
  6. Parse the serialized file with Apache Arrow in the browser, store it globally.
  7. Pass a reference to the Apache Arrow file to whatever provider needs it.

I am not sure where this backend code would live.


Our use reminds me of the Plasma store:

This blog post presents Plasma, an in-memory object store that is being developed as part of Apache Arrow. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime. [...] Expensive serialization and deserialization as well as data copying are a common performance bottleneck in distributed computing.

I don't think it would make sense to use Plasma in the frontend, but its existence suggests to me that Apache Arrow works well as an in memory representation of immutable data that needs to be accessed by multiple different providers at once.

ellisonbg commented 7 years ago

Eventually yes, this is very much the direction we are thinking. Our plan is to create a uniform and consistent set of tabular data APIs in JupyterLab. We have a start of of that in phosphor here:

https://github.com/phosphorjs/phosphor/blob/master/packages/datagrid/src/datamodel.ts

But there are number of additional things we need to add before we can really start to depend on it for this like visualization:

The pieces are definitely starting to fall into place for this, and I think we can start to build with this goal in mind, even if we can't start to use arrow immediately. A good starting point would be to begin using the JSON table schema as the in memory format for now. A couple of reasons for that initially:

But, this does clarify that there are actually two questions here:

saulshanabrook commented 7 years ago

We just opened a number of issues to lay out the different possible ways Voyager can be opened. I propose we break up "opening voyager with some data" into two different parts:

This intermediate data format could be any of (from most to least specified):

Parsing the data before passing it into Voyager gives us flexibility in the future to potentially cache or deduplicate the data in memory.

saulshanabrook commented 7 years ago

Actually, instead will try to pass inputs in as close to their input formats as possible and let Voyager parse them properly. Voyager supports the Data interface in vega lite as well as vega-lite specs. We won't worry about importing Voyager redux state for now, since the format of that is still in flux.