RumbleDB / rumble

⛈️ RumbleDB 1.22.0 "Pyrenean oak" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
http://rumbledb.org/
Other
213 stars 82 forks source link

using rumbledb interactively #306

Closed pjfanning closed 5 years ago

pjfanning commented 5 years ago

I'd be interested in trying to have a JVM which accepted HTTP requests that included JSONiq queries and that used rumbledb code to run the queries. The rumbledb samples use spark-submit to run queries from the command-line or from a shell. I would like to leave the HTTP server running over extended periods and create a UI that interacted with it.

I would appreciate if someone could give me some advice on where to start.

ghislainfourny commented 5 years ago

Hello and many thanks for your interest in Rumble.

I think it should be doable in a simple way by using the same internal JSONiq execution API used by the CLI and the Shell. It may be a nice opportunity to make this API "official" for all who want to invoke queries in other ways than the Shell (including HTTP). Also, this is something that would be nice to have as a public "try-it-out" page.

Before we dig more, could you elaborate on these two points? The simplest HTTP server I can imagine receives a JSONiq query (POST) and reads data from no input, and outputs something small enough to be sent as an HTTP response. A more elaborate HTTP server would have an underlying cluster.

  1. Input: do you plan to support only queries with no input (parallelize)? Or if you need to read large datasets, where would the data lie (would you have your own internal cluster reachable from the HTTP server)?
  2. Output: do you plan to support queries with only short output (printable to screen and reasonable size to download)? Or if you need to output large files: where to (again, an internal cluster)?

Thanks!

pjfanning commented 5 years ago

Thanks @ghislainfourny for your detailed response. I agree that there a quite a few different scenarios that could be supported. In my use case, I would be dealing with reasonably sized data sets so would like to return the data in the HTTP response. The data to process would already be stored in HDFS or maybe in AWS S3.

ghislainfourny commented 5 years ago

It makes sense.

I would recommend giving it a first try by reusing JsoniqQueryExecutor.run() like so, after saving the query received in the HTTP POST request under the path querypath on HDFS, and picking some location outputpath on HDFS:

String querypath = "hdfs://host:port/user/hadoop/query.jq"; // JSONiq query should have been copied to this location
String outputpath = "hdfs://host:port/user/hadoop/output"; // make sure it does not already exist!
SparksoniqRuntimeConfiguration sparksoniqConf = new SparksoniqRuntimeConfiguration(new String[] { "--result-size", "1000" }); // simulate CLI parameters, you can also set higher to allow more objects in the output (but it is not recommended setting too high to avoid a crash)
JsoniqQueryExecutor rumbleEngine = new JsoniqQueryExecutor(false, sparksoniqConf);
rumbleEngine.run(querypath, outputpath);

Then, you can read from outputpath, concatenate the files and output them as an HTTP response.

This should allow you to build an HTTP server prototype, with the rumble JAR on the classpath and with just the above few lines to invoke the JSONiq query and write its output to HDFS.

Note that you may need to embed the HTTP server inside a jar, and call this jar with spark-submit to make sure everything is executed with a Spark environment. The main function in the HTTP server jar can create the HTTP server and start listening on port 8080 or alike, then invoke the above code.

When this works, we could extend the API to provide a more efficient function (for example, that collects the output directly to memory so you don't have to read back the output from HDFS).

Another possibility, completely different (but more encapsulated), is to have the HTTP server call Rumble via the CLI, invoking spark-submit from within Java with the rumble jar and --query-path and --output-path appropriately set. But it would be slower because the executors must be allocated and deallocated every time.

pjfanning commented 5 years ago

@ghislainfourny it would suit my use case best if I could pass the queryText without sticking it in a file and then use JsoniqQueryExecutor run or runInteractive to get the result. Would it be feasible to extend JsoniqQueryExecutor to have a method that took the queryText as a parameter?

ghislainfourny commented 5 years ago

Absolutely. There is actually a public Java API on the way that will address this and give a simple, high-level way to execute a query and go through its results. The goal is that it can be used via a simple maven import once it is registered on maven central.

ghislainfourny commented 5 years ago

@pjfanning we now have an official Maven repository:

https://search.maven.org/search?q=g:com.github.rumbledb

and the public API is here:

http://rumbledb.org/site/apidocs/org/rumbledb/api/package-summary.html

pjfanning commented 5 years ago

@ghislainfourny thanks, I'll try that out