Hybrid Spark/Local execution engine

okennedy commented 5 years ago

Presently, all of Mimir's query processing is handled by Spark. Spark scales very well, but has an unpleasantly high start-up cost, making it completely inappropriate for small, simple queries. A way around this would be to allow Mimir to process queries internally for sufficiently small / local-only datasets (e.g., locally loaded CSV / JSON data).

This feature would likely first require migrating responsibility for tracking table definitions from Hive to Mimir (see #321). With Mimir tracking table dependencies, we'd need...

[ ] Iterator-style implementation of Mimir's Relational Algebra (some implementations already exist)
- [x] Projection
- [x] Union
- [ ] Filtering
- [ ] Standard joins (Hash, Nested Loop, Sort/Merge)
- [ ] Sort
- [ ] Limit
- [ ] Aggregate / Group-By
[ ] A compiler to translate RA to the corresponding iterators (e.g., See Compiler.scala)
[ ] A physical optimizer...
- [ ] ... that can pick the right join plan
- [ ] ... that can decide whether to execute the query locally or in Spark (a simple heuristic might be good enough to start: Use Spark if a data source comes from S3, Hive, etc... and bypass spark otherwise)
- [ ] ... that can partition the query between local and Spark execution (Big, beefy files get processed spark-side, smaller aggregate datasets get processed locally)
- [ ] ... that can create optimized (e.g., projection-aware) data loaders.

lordpretzel commented 5 years ago

That sounds like a very heavy-handed approach to the Spark start-up cost. Also the question is how well a Scala DB engine will perform without doing Tungsten-style trickery. Can't we use a light-weight DBMS with good support for external tables? So we only have to write another backend-language specific RA-to-SQL compiler variant instead of a full query engine?

I think in the long run, we need to extend the idea of running notebooks on samples and then deploying/generalizing the workflow to the full dataset, but this is futher away in the future

okennedy commented 5 years ago

I'm not sure I agree that it's heavy handed. We don't want to re-implement Postgres or MemSQL here. This is just a stand-in for short-running queries over small data. This is basically the class project for UB's CSE-562 -- it's very doable in a short time. The main complexity -- deciding how to federate queries with spark -- would still be present if we used an external database. That being said, you're right that it's complexity that would be good to externalize if at all possible.

I don't think SQLite is a good fit. Going through SQL has been the source of a large number of bugs: SQL's user-oriented capitalization semantics, variable/table naming, and more. SQLite also suffers from significant overheads from UDFs. On queries with heavy uncertainty Spark, even with its overheads, is significantly faster.

I would definitely prefer something that lets us go directly through RA and staying inside the JVM. Perhaps something like DBLab might be a good fit (Note: I haven't looked at it at all yet, I'm just familiar with the premise).

UBOdin / mimir

Hybrid Spark/Local execution engine #322