RumbleDB / rumble

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
http://rumbledb.org/
Other
212 stars 82 forks source link

Chaining array-unnest and member lookup on Parquet files produces errornous results #812

Closed ingomueller-net closed 3 years ago

ingomueller-net commented 3 years ago

Consider the following file test.json:

{ "a": [{"x": 1, "y": 2}]}

Convert it to Parquet by running the following script through spark-submit:

import pyspark.sql

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.read.format('json').load('test.json')
df.write.parquet('test.parquet')

Now the following happens in Rumble:

rumble$ parquet-file("test.parquet/*.parquet")
>>>
>>>
{ "a" : [ { "x" : 1, "y" : 2 } ] }
The query took 258 milliseconds to execute.
rumble$ parquet-file("test.parquet/*.parquet").a
>>>
>>>
[ { "x" : 1, "y" : 2 } ]
The query took 295 milliseconds to execute.
rumble$ parquet-file("test.parquet/*.parquet").a[]
>>>
>>>
{ "x" : 1, "y" : 2 }
The query took 277 milliseconds to execute.
rumble$ parquet-file("test.parquet/*.parquet").a[].x
>>>
>>>

The query took 223 milliseconds to execute.

Notice that the correct result of the last query should be:

1

This is probably due to a bug in pushing down projections and array unnesting into Spark.

ghislainfourny commented 3 years ago

Thank you for reporting this, Ingo. I will look into it.

ingomueller-net commented 3 years ago

Attaching test.zip which contains test.parquet created as described above.

ghislainfourny commented 3 years ago

A fix is on the way.

ghislainfourny commented 3 years ago

This is now fixed. Thank you for reporting this and feel free to close this Issue if you can confirm that this works.