UBOdin / mimir

Data-ish exploration through SQL+Uncertainty
http://mimirdb.info
Apache License 2.0
26 stars 13 forks source link

Hybrid Spark/Local execution engine #322

Open okennedy opened 5 years ago

okennedy commented 5 years ago

Presently, all of Mimir's query processing is handled by Spark. Spark scales very well, but has an unpleasantly high start-up cost, making it completely inappropriate for small, simple queries. A way around this would be to allow Mimir to process queries internally for sufficiently small / local-only datasets (e.g., locally loaded CSV / JSON data).

This feature would likely first require migrating responsibility for tracking table definitions from Hive to Mimir (see #321). With Mimir tracking table dependencies, we'd need...

lordpretzel commented 5 years ago

That sounds like a very heavy-handed approach to the Spark start-up cost. Also the question is how well a Scala DB engine will perform without doing Tungsten-style trickery. Can't we use a light-weight DBMS with good support for external tables? So we only have to write another backend-language specific RA-to-SQL compiler variant instead of a full query engine?

I think in the long run, we need to extend the idea of running notebooks on samples and then deploying/generalizing the workflow to the full dataset, but this is futher away in the future

okennedy commented 5 years ago

I'm not sure I agree that it's heavy handed. We don't want to re-implement Postgres or MemSQL here. This is just a stand-in for short-running queries over small data. This is basically the class project for UB's CSE-562 -- it's very doable in a short time. The main complexity -- deciding how to federate queries with spark -- would still be present if we used an external database. That being said, you're right that it's complexity that would be good to externalize if at all possible.

I don't think SQLite is a good fit. Going through SQL has been the source of a large number of bugs: SQL's user-oriented capitalization semantics, variable/table naming, and more. SQLite also suffers from significant overheads from UDFs. On queries with heavy uncertainty Spark, even with its overheads, is significantly faster.

I would definitely prefer something that lets us go directly through RA and staying inside the JVM. Perhaps something like DBLab might be a good fit (Note: I haven't looked at it at all yet, I'm just familiar with the premise).