NICTA / serene

Serene Data Integration Platform
Apache License 2.0
5 stars 6 forks source link

Training fails if we have too many features (400+) #3

Open nruemmele opened 7 years ago

nruemmele commented 7 years ago

When using char-dist-features + header features for the domain "dbpedia", we get many features (400+). The training of RandomForestClassifier with Spark fails with the error: Cause: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

Apparently, there's a bug in Spark, but it's not clear if there is an easy fix for this problem:

SparkTestSpec reproduces this error currently.

nruemmele commented 7 years ago