NICTA / serene

Serene Data Integration Platform
Apache License 2.0
5 stars 6 forks source link

Training fails if we have too many features (400+) #3

Open nruemmele opened 7 years ago

nruemmele commented 7 years ago

When using char-dist-features + header features for the domain "dbpedia", we get many features (400+). The training of RandomForestClassifier with Spark fails with the error: Cause: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

Apparently, there's a bug in Spark, but it's not clear if there is an easy fix for this problem: https://issues.apache.org/jira/browse/SPARK-16845 http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe https://issues.apache.org/jira/browse/SPARK-17092

SparkTestSpec reproduces this error currently.

nruemmele commented 7 years ago

https://jira.csiro.au/browse/SERENE-202