Open okennedy opened 5 years ago
That sounds like a very heavy-handed approach to the Spark start-up cost. Also the question is how well a Scala DB engine will perform without doing Tungsten-style trickery. Can't we use a light-weight DBMS with good support for external tables? So we only have to write another backend-language specific RA-to-SQL compiler variant instead of a full query engine?
I think in the long run, we need to extend the idea of running notebooks on samples and then deploying/generalizing the workflow to the full dataset, but this is futher away in the future
I'm not sure I agree that it's heavy handed. We don't want to re-implement Postgres or MemSQL here. This is just a stand-in for short-running queries over small data. This is basically the class project for UB's CSE-562 -- it's very doable in a short time. The main complexity -- deciding how to federate queries with spark -- would still be present if we used an external database. That being said, you're right that it's complexity that would be good to externalize if at all possible.
I don't think SQLite is a good fit. Going through SQL has been the source of a large number of bugs: SQL's user-oriented capitalization semantics, variable/table naming, and more. SQLite also suffers from significant overheads from UDFs. On queries with heavy uncertainty Spark, even with its overheads, is significantly faster.
I would definitely prefer something that lets us go directly through RA and staying inside the JVM. Perhaps something like DBLab might be a good fit (Note: I haven't looked at it at all yet, I'm just familiar with the premise).
Presently, all of Mimir's query processing is handled by Spark. Spark scales very well, but has an unpleasantly high start-up cost, making it completely inappropriate for small, simple queries. A way around this would be to allow Mimir to process queries internally for sufficiently small / local-only datasets (e.g., locally loaded CSV / JSON data).
This feature would likely first require migrating responsibility for tracking table definitions from Hive to Mimir (see #321). With Mimir tracking table dependencies, we'd need...