RumbleDB / rumble

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
http://rumbledb.org/
Other
211 stars 82 forks source link

count(json-file("...")) returns wrong result #70

Closed ingomueller-net closed 5 years ago

ingomueller-net commented 6 years ago

In my case, it returns always 100 for a file that at least millions of records. I guess that this is the number of lines that are displayed if if the result is larger. This "display filter" seems to be applied before sum.

ghislainfourny commented 6 years ago

I think it's just because the iterators for count() and other aggregations are not (yet) pushed down to Spark if the argument is wrapped around an RDD, but try to materialize the RDD locally to compute the aggregation. It should be a very easy fix.

ghislainfourny commented 5 years ago

I am copying over a comment from another thread

Only count(json-file("../confusion_sample.json")) and queries in which the child iterator directly has an RDD needs to be mentioned in this issue for now.

The idea, on the RDD side of the switch, is to not recursively invoke next() as is done on the non-RDD side (this is what otherwise triggers the take()...), but instead simply to get the actual RDD (potentially dynamically casting the child iterator down to a type that supports getting the actual RDD), invoke a Spark count() transformation on it, wrap the obtained number into a JSONiq integer and return it. It is that simple.

Supporting via variables is more advanced and lower priority, no need to look into it but you can still open a separate, lower priority issue to remember it.

CanBerker commented 5 years ago

Problem with the count function has been fixed in PR. Other aggregate functions will be addressed later on.