elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
18 stars 98 forks source link

'Requested column [0] is not in the DataFrame.' #332

Open helo-ch opened 3 years ago

helo-ch commented 3 years ago

When trying to create a Spark DataFrame from an Eland Dataframe, I get the following error : KeyError: 'Requested column [0] is not in the DataFrame.'

I tried renaming/filtering out columns with special characters (@), specifying the schema on createDataFrame(), I always get the same error.

Is it not possible to create a Pyspark Dataframe from an Eland Dataframe? I'm using sdf = spark.createDataFrame(df_filter, schema = df_schema)

It works fine when creating Spark df from a pandas df (after converting eland df with eland_to_pandas()), but that's not really ideal for big dataframes.

Full error for more details :

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py in <module>
----> 1 sdf = spark.createDataFrame(df_filter, schema = df_schema)

/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    411         # make sure data could consumed multiple times
    412         if not isinstance(data, list):
--> 413             data = list(data)
    414 
    415         if schema is None or isinstance(schema, (list, tuple)):

/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in __getitem__(self, key)
    458 
    459     def __getitem__(self, key):
--> 460         return self._getitem(key)
    461 
    462     def __dir__(self):

/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in _getitem(self, key)
   1211             return DataFrame(_query_compiler=self._query_compiler._update_query(key))
   1212         else:
-> 1213             return self._getitem_column(key)
   1214 
   1215     def _getitem_column(self, key):

/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in _getitem_column(self, key)
   1215     def _getitem_column(self, key):
   1216         if key not in self.columns:
-> 1217             raise KeyError(f"Requested column [{key}] is not in the DataFrame.")
   1218         s = self._reduce_dimension(self._query_compiler.getitem_column_array([key]))
   1219         return s

KeyError: 'Requested column [0] is not in the DataFrame.'
sethmlarson commented 3 years ago

I wonder if this is something with how Spark accesses the DataFrame and there are interfaces / behaviors that are missing from Eland's DataFrame.