Quansight / intake-omnisci

Intake-OmniSci: OmniSci Plugin for Intake
Apache License 2.0
2 stars 1 forks source link

Design issue for further development #7

Open ian-r-rose opened 5 years ago

ian-r-rose commented 5 years ago

Based off of a conversation with @niviksha @dharhas on slack.

The current state

The main focus of this intake driver thus far has been on consuming data that has already been loaded into an OmniSci database. The general flow looks like

cat  = intake.open_catalog('catalog.yml')
expr = cat.table_name.to_ibis()

and then proceed with the ibis expression to assemble your query. This is fairly idiomatic intake, and is well-supported already.

OmniSci as an alternative container.

The main containers types for intake are dask and pandas dataframes. We have recently added what amounts to an ibis container type to support the above workflow. Here, we would like to push that a bit further and make OmniSciDB a container type of its own (allowing users to interact with it via ibis). The workflow would look something like

  1. The user has some files in an intake catalog (e.g. CSV, Parquet, GeoJSON)
  2. Loading the files, instead of reading them using pandas/dask, would read them into an omnisci database.
  3. The user then gets an ibis handle on the database table.

The pseudocode for this interaction would look almost identical to the above

cat  = intake.open_catalog('catalog.yml')
expr = cat.csv_name.to_ibis()

The difference being that the CSV (or whatever) is loaded into the database first (using COPY FROM, probably), and then you get your handle to ibis. The read step would potentially be time-consuming, but perhaps maybe better than loading it into memory.

Things to consider

A few outstanding questions in my mind:

  1. How robustly can we handle various formats? There are lots of corner cases with CSV, Shapefile, etc, that can be tough to deal with. CSV loaders typically have lots of options for dealing with these. Can we do this well enough that we could cover most cases?
  2. With this flow, there is a danger of multiple sources-of-truth, one from the file, and one from the database table. What happens when the user wants to re-load the data? Do we replace the old table? My instinct is that we should, but it might be pretty expensive, and there is a danger of data loss...
niviksha commented 5 years ago

thanks @ian-r-rose for getting this rolling!

Here's my $0.02 on the questions.

  1. Agreed, some of this depends on the robustness of the OmniSci load paths for csv etc. Luckily, these are quite proven (there will always be edge cases though, but we are encountering and solving for these in the core db development path anyway). Our csv loader is fairly robust and performant (can easily handle 4MM rows/sec in ideal circumstances, and is currently being worked on to make it even more so)

  2. This is a great point. Some of it comes down to the semantics of a database vs. a transient in-memory analytical data structure. In needing to handle persistence as a first-class idea, we deal already with both update and append scenarios. e.g. by default, a load table invocation via pymapd triggers an append, while there are explicit commands to drop/truncate/delete etc. as a result of being a database engine. so i'd hope that over time we could offer these existing operations as configurable options at load time (e.g. append by default, truncate and reload etc). Further, we're headed down the path of natively reading on-disk formats without explicitly needing to import them - e.g. with parquet (and arrow more immediately), we are working on treating them like externally registered containers, so this will likely become even less of a friction in the next few months.

dharhas commented 5 years ago

So part of why intake is successful is that it leverages some fairly standardized data structures and readers in python land, i.e. xarray etc. Would the plan be to support reading all those data types in here or just the csv type.

It seems to me that if you want to avoid serializing anything into in-memory python data structures then the emphasis should be on recognizing which intake loaders are in formats consumable by the OmnisciDb data loaders and then just calling those.

In some sense this is more about the omniscidb data loaders understanding the intake catalog format than any actual code in intake.

niviksha commented 5 years ago

@dharhas @ian-r-rose a handle to the supported file formats OmniSci can ingest is here -

https://docs.omnisci.com/latest/6_loading_data.html#copy-from

@dharhas agreed. Our first goal is to avoid building an alternate catalog infra. If we can do the following as a first pass, it would be great

1) Point the OmniSci loader workflow at an intake catalog server, purely with the intent to use the catalog as a, well, catalog of data sources available.

2) In the first pass, OmniSci can simply consume the small subset of intake formats that OmniSci already supports.

Normally we'd do this via the COPY FROM CLI (or a python wrapper around its underlying thrift call). Mostly these are delimited[+compressed] formats like csv(.*gz) etc. I think an MvP would be minimal new development for a loader that wraps COPY FROM API. To @ian-r-rose earlier point, this could be hidden under a to_ibis cal

Over time, it would be good to write loaders for more (within reason) of the plug in formats that Intake supports.