VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

LoadCSV (with warnings) kills the GC during query optimization on sufficiently wide datasets. #227

Open okennedy opened 1 year ago

okennedy commented 1 year ago

Describe the bug Spark's optimizer tries to aggressively inline projections. This means it replicates the entire inlined subtree, each spot it occurs in. That's fine when the column being inlined is small, or is inlined a fixed number of times. Otoh, LoadSparkCSV creates the query:

Project[ csv.col1, csv.col2, csv.col3, ... ]
  Project [ from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ) ) as csv ]

which optimizes to:

Project[ 
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col1,
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col2,
   from_csv( input_data, Struct( StructField(col1), StructField(col2), StructField(col3), ... ).col3,
   ...
]

In other words, the size of the resulting tree is $O(n^2)$ in the number of columns, which, for a sufficiently wide dataset will thrash the GC (~150 columns seems to be enough to do it).

Expected behavior Don't thrash the GC. Some thoughts on a fix: