RumbleDB / rumble

⛈️ RumbleDB 1.22.0 "Pyrenean oak" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
http://rumbledb.org/
Other
213 stars 82 forks source link

Add support for writing in different output formats #776

Closed ingomueller-net closed 2 years ago

ingomueller-net commented 4 years ago

The possibility to write the query result into a file (like through --output-path) should be extended to other formats such as compressed JSON, Parquet, CSV, etc.

This would extend Rumble to a completely new set of use cases, namely for converting data rather than just querying it. I'd argue that this is a very common use case: when you deal with messy data, you often want to clean it and load it into a different system once you are done. Rumble could be ideal for that case, but is missing the write support of other formats.

Note that many formats have a great number of configuration options: compressed JSON can be compressed with Snappy, GZIP, LZMA, etc each of which has different compression levels, Parquet can have different encodings and different compression algorithms and levels for each column, etc. Ideally, all of these should be exposed. (Otherwise, the user needs to do a second conversion using a different tool, and then writing to JSON would have been enough...)

ghislainfourny commented 4 years ago

Very good suggestion, Ingo, thank you.

ghislainfourny commented 4 years ago

This is implemented with --output-format and --output-format-option: * where this is all forwarded to Spark's format() and option() methods on the output writer object.

An error if thrown if the output format is not JSON and the output is not structured (aka DataFrame). For direct format-to-format conversion use cases, this is the case as parquet-file(), etc, all return DataFrames. For more complex queries, an annotate() call with a schema specifying the type of each column does the trick.

I am not sure if this also covers the ability to specify compression per column. If this is not the case I just need a pointer to how this is done in Spark and the CLI can then be extended accordingly.