Open ian-r-rose opened 5 years ago
thanks @ian-r-rose for getting this rolling!
Here's my $0.02 on the questions.
Agreed, some of this depends on the robustness of the OmniSci load paths for csv etc. Luckily, these are quite proven (there will always be edge cases though, but we are encountering and solving for these in the core db development path anyway). Our csv loader is fairly robust and performant (can easily handle 4MM rows/sec in ideal circumstances, and is currently being worked on to make it even more so)
This is a great point. Some of it comes down to the semantics of a database vs. a transient in-memory analytical data structure. In needing to handle persistence as a first-class idea, we deal already with both update and append scenarios. e.g. by default, a load table
invocation via pymapd triggers an append, while there are explicit commands to drop/truncate/delete
etc. as a result of being a database engine. so i'd hope that over time we could offer these existing operations as configurable options at load time (e.g. append by default, truncate and reload etc). Further, we're headed down the path of natively reading on-disk formats without explicitly needing to import them - e.g. with parquet (and arrow more immediately), we are working on treating them like externally registered containers, so this will likely become even less of a friction in the next few months.
So part of why intake is successful is that it leverages some fairly standardized data structures and readers in python land, i.e. xarray etc. Would the plan be to support reading all those data types in here or just the csv type.
It seems to me that if you want to avoid serializing anything into in-memory python data structures then the emphasis should be on recognizing which intake loaders are in formats consumable by the OmnisciDb data loaders and then just calling those.
In some sense this is more about the omniscidb data loaders understanding the intake catalog format than any actual code in intake.
@dharhas @ian-r-rose a handle to the supported file formats OmniSci can ingest is here -
https://docs.omnisci.com/latest/6_loading_data.html#copy-from
@dharhas agreed. Our first goal is to avoid building an alternate catalog infra. If we can do the following as a first pass, it would be great
1) Point the OmniSci loader workflow at an intake catalog server, purely with the intent to use the catalog as a, well, catalog of data sources available.
2) In the first pass, OmniSci can simply consume the small subset of intake formats that OmniSci already supports.
Normally we'd do this via the COPY FROM
CLI (or a python wrapper around its underlying thrift call). Mostly these are delimited[+compressed] formats like csv(.*gz) etc. I think an MvP would be minimal new development for a loader that wraps COPY FROM
API. To @ian-r-rose earlier point, this could be hidden under a to_ibis
cal
Over time, it would be good to write loaders for more (within reason) of the plug in formats that Intake supports.
Based off of a conversation with @niviksha @dharhas on slack.
The current state
The main focus of this intake driver thus far has been on consuming data that has already been loaded into an OmniSci database. The general flow looks like
and then proceed with the ibis expression to assemble your query. This is fairly idiomatic
intake
, and is well-supported already.OmniSci as an alternative container.
The main containers types for intake are dask and pandas dataframes. We have recently added what amounts to an ibis container type to support the above workflow. Here, we would like to push that a bit further and make OmniSciDB a container type of its own (allowing users to interact with it via ibis). The workflow would look something like
The pseudocode for this interaction would look almost identical to the above
The difference being that the CSV (or whatever) is loaded into the database first (using
COPY FROM
, probably), and then you get your handle to ibis. The read step would potentially be time-consuming, but perhaps maybe better than loading it into memory.Things to consider
A few outstanding questions in my mind: