Closed nickhalmagyi closed 3 years ago
Thanks for reporting issue for Koalas!
i'm going to take a look at this.
@ueshin @HyukjinKwon Could you let me know about maybe is there a special reason that we are disallowing to creation DataFrame
(actually Series
more exactly) with null dataset??
>>> ks.DataFrame()
...
ValueError: can not infer schema from empty or null dataset
>>> pd.DataFrame()
Empty DataFrame
Columns: []
Index: []
It's because PySpark, by default, tries to infer the type from the given data. If there's no data or only nulls in the column, PySpark cannot infer its data type for a DataFrame.
>>> import pandas as pd
>>> row = {'a': [1], 'b':[None]}
>>> pd.DataFrame(row).dtypes
a int64
b object
pandas has object
type that can contain everything; whereas PySpark does not have such type. So, it's actually an issue in PySpark.
Maybe we should have a way to explicitly set the schema (or dtypes) and avoid type inference in Koalas so that we can allow both null and empty DataFrame cases.
So pandas has the "dtype" keyword, e.g.
row = {'a': [1], 'b':[None]}
df= pd.DataFrame(row, dtype=np.int64)
It allows for only a single type to be passed, which is then enforced for all columns.
Could you allow the full schema to be passed as a keyword, much like for Spark DF's? Something like
row = {'a': [1], 'b':[None]}
df= ks.DataFrame(row, dtypes=[np.int64, np.float64])
If the dtypes cannot be parsed such as here:
row = {'a': [1], 'b':["I am a string"]}
df= ks.DataFrame(row, dtypes=[np.int64, np.float64])
then it would error.
Yeah, I think the last case should would.
This also occurs when initialising an empty dataframe through ks.DataFrame([])
.
Thanks for sharing that, @jorijnsmit.
So what is the plan for this issue? What should we do we empty columns so that it doesn't throw such an error?
So one way I've been able to get around this issue, @ntrang086, is initializing the empty dataframe in Spark (with an explicit schema) and then translating to koalas.
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType, NumericType, ArrayType
schema = StructType([
StructField("a", StringType()),
StructField("b", TimestampType()),
StructField("c", DoubleType()),
]
)
kdf = spark.createDataFrame([] ,schema = schema).to_koalas()
@mwb222 Thanks for sharing the workaround. My use case is that: starting with a non-empty pandas dataframe, convert it to koalas and then try to use koalas functionalities. I've encountered this type of error when doing this. The issue is that when using koalas groupby on a dataframe converted from pandas, sometimes I end up with empty dataframes that are a result from the groupby call in koalas. Therefore I get this error.
Is there a workaround for this use case?
Will I have to convert pandas to pyspark with a specified schema and then convert to koalas?
Ah I see - I haven't run into that situation in my own work so don't know if the workaround above will help, but it might be worth a try to see if the explicit schema carries through in the background. Sorry!
@ntrang086 How about using type hints for the function you are trying to apply, assuming you are trying to use apply
or something which take a user function?
@ntrang086 How about using type hints for the function you are trying to apply, assuming you are trying to use
apply
or something which take a user function?
The function doesn't return a single value of a specific type like str, float or int, but returns a pandas dataframe or pandas series which can be empty and this has caused the issue described above. Koalas already knows the return type is a pandas dataframe or series and has a problem "accepting" it :) . I could try it what you suggested and see how it goes but not sure it'd work.
JFYI... using read_csv function with a column without values I don't receive any errors, but with an read_excel() the same error is raised.
@ederfdias Here is a possible workaround. Specify converters like below:
import numpy as np
df_ks = koalas.read_excel(
...
converters={i : (lambda x: str(x) if x else np.NaN) for i in range(30)} # Read first 30 columns as string
Apparently, np.NaN
does the trick
import numpy as np
row = {'a': [1], 'b':[np.NaN]}
koalas.DataFrame(row).to_spark().where(F.col("b").isNull()).show()
output
+---+----+
| a| b|
+---+----+
| 1|null|
+---+----+
Now it works properly on pandas-on-Spark (It's available in Apache Spark 3.2 and above)
I'd recommend to use pandas-on-Spark rather than Koalas since Koalas now only maintenance mode.
>>> import pyspark.pandas as ps
>>> ps.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> ps.DataFrame([{"A": [None]}])
A
0 [None]
In case some one drop here for any reason. I have tested in koalas 1.8.1 and the problem (with read_excel) does not happens anymore.
I find that reading a dict
but for pandas there is no error
I have tried setting dtype=np.int64 but this has not helped.