cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
http://joerihermans.com/work/distributed-keras/
GNU General Public License v3.0
623 stars 169 forks source link

Addressing AnalysisException error in workflow example #77

Open reedv opened 6 years ago

reedv commented 6 years ago

Without this change, currently getting the error

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-38-3d60eef83573> in <module>()
      1 # Only select the columns we need (less data shuffling) while training.
----> 2 dataset = dataset.select("features_normalized", "label_index", "label")
      3 print(dataset)

/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/dataframe.py in select(self, *cols)
    990         [Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]
    991         """
--> 992         jdf = self._jdf.select(self._jcols(*cols))
    993         return DataFrame(jdf, self.sql_ctx)
    994 

/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/opt/mapr/spark/spark-2.1.0/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: u"Reference 'label_out' is ambiguous, could be: label#320, label#324.;"

when running the section

# For example:
# 1. Assume we have a label index: 3
# 2. Output dimensionality: 5
# With these parameters, we obtain the following vector in the DataFrame column: [0,0,0,1,0]

transformer = OneHotTransformer(output_dim=nb_classes, input_col="label_index", output_col="label")
dataset = transformer.transform(dataset)
# Only select the columns we need (less data shuffling) while training.
dataset = dataset.select("features_normalized", "label_index", "label")

Seems to be due to the fact that he raw_dataset already contains a field called Label and there seems to be a conflict in differentiating between this field and the label field that gets created later (as if the pyspark.sql SparkSession looks at both in lowercase form).