When trying to create a Spark DataFrame from an Eland Dataframe, I get the following error :
KeyError: 'Requested column [0] is not in the DataFrame.'
I tried renaming/filtering out columns with special characters (@), specifying the schema on createDataFrame(), I always get the same error.
Is it not possible to create a Pyspark Dataframe from an Eland Dataframe?
I'm using sdf = spark.createDataFrame(df_filter, schema = df_schema)
It works fine when creating Spark df from a pandas df (after converting eland df with eland_to_pandas()), but that's not really ideal for big dataframes.
Full error for more details :
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py in <module>
----> 1 sdf = spark.createDataFrame(df_filter, schema = df_schema)
/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema)
749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py in _createFromLocal(self, data, schema)
411 # make sure data could consumed multiple times
412 if not isinstance(data, list):
--> 413 data = list(data)
414
415 if schema is None or isinstance(schema, (list, tuple)):
/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in __getitem__(self, key)
458
459 def __getitem__(self, key):
--> 460 return self._getitem(key)
461
462 def __dir__(self):
/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in _getitem(self, key)
1211 return DataFrame(_query_compiler=self._query_compiler._update_query(key))
1212 else:
-> 1213 return self._getitem_column(key)
1214
1215 def _getitem_column(self, key):
/opt/conda/lib/python3.7/site-packages/eland/dataframe.py in _getitem_column(self, key)
1215 def _getitem_column(self, key):
1216 if key not in self.columns:
-> 1217 raise KeyError(f"Requested column [{key}] is not in the DataFrame.")
1218 s = self._reduce_dimension(self._query_compiler.getitem_column_array([key]))
1219 return s
KeyError: 'Requested column [0] is not in the DataFrame.'
When trying to create a Spark DataFrame from an Eland Dataframe, I get the following error :
KeyError: 'Requested column [0] is not in the DataFrame.'
I tried renaming/filtering out columns with special characters (
@
), specifying the schema oncreateDataFrame()
, I always get the same error.Is it not possible to create a Pyspark Dataframe from an Eland Dataframe? I'm using
sdf = spark.createDataFrame(df_filter, schema = df_schema)
It works fine when creating Spark df from a pandas df (after converting eland df with
eland_to_pandas()
), but that's not really ideal for big dataframes.Full error for more details :