databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.34k stars 358 forks source link

ValueError when reading dict with None #1084

Closed nickhalmagyi closed 3 years ago

nickhalmagyi commented 4 years ago

I find that reading a dict

row =  {'a': [1], 'b':[None]}
ks.DataFrame(row)

ValueError: can not infer schema from empty or null dataset

but for pandas there is no error

row =  {'a': [1], 'b':[None]}
print(pd.DataFrame(row))

   a     b
0  1  None

I have tried setting dtype=np.int64 but this has not helped.

itholic commented 4 years ago

Thanks for reporting issue for Koalas!

i'm going to take a look at this.

itholic commented 4 years ago

@ueshin @HyukjinKwon Could you let me know about maybe is there a special reason that we are disallowing to creation DataFrame (actually Series more exactly) with null dataset??

>>> ks.DataFrame()
...
ValueError: can not infer schema from empty or null dataset

>>> pd.DataFrame()
Empty DataFrame
Columns: []
Index: []
HyukjinKwon commented 4 years ago

It's because PySpark, by default, tries to infer the type from the given data. If there's no data or only nulls in the column, PySpark cannot infer its data type for a DataFrame.

>>> import pandas as pd
>>> row =  {'a': [1], 'b':[None]}
>>> pd.DataFrame(row).dtypes
a     int64
b    object

pandas has object type that can contain everything; whereas PySpark does not have such type. So, it's actually an issue in PySpark.

HyukjinKwon commented 4 years ago

Maybe we should have a way to explicitly set the schema (or dtypes) and avoid type inference in Koalas so that we can allow both null and empty DataFrame cases.

nickhalmagyi commented 4 years ago

So pandas has the "dtype" keyword, e.g.

row =  {'a': [1], 'b':[None]}
df= pd.DataFrame(row, dtype=np.int64)

It allows for only a single type to be passed, which is then enforced for all columns.

Could you allow the full schema to be passed as a keyword, much like for Spark DF's? Something like

row =  {'a': [1], 'b':[None]}
df= ks.DataFrame(row, dtypes=[np.int64, np.float64])

If the dtypes cannot be parsed such as here:

row =  {'a': [1], 'b':["I am a string"]}
df= ks.DataFrame(row, dtypes=[np.int64, np.float64])

then it would error.

HyukjinKwon commented 4 years ago

Yeah, I think the last case should would.

gosuto-inzasheru commented 4 years ago

This also occurs when initialising an empty dataframe through ks.DataFrame([]).

HyukjinKwon commented 4 years ago

Thanks for sharing that, @jorijnsmit.

ntrang086 commented 4 years ago

So what is the plan for this issue? What should we do we empty columns so that it doesn't throw such an error?

mwb222 commented 4 years ago

So one way I've been able to get around this issue, @ntrang086, is initializing the empty dataframe in Spark (with an explicit schema) and then translating to koalas.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType, NumericType, ArrayType

schema = StructType([
            StructField("a", StringType()),
            StructField("b", TimestampType()),
            StructField("c", DoubleType()),
            ]
          )

kdf = spark.createDataFrame([] ,schema = schema).to_koalas()
ntrang086 commented 4 years ago

@mwb222 Thanks for sharing the workaround. My use case is that: starting with a non-empty pandas dataframe, convert it to koalas and then try to use koalas functionalities. I've encountered this type of error when doing this. The issue is that when using koalas groupby on a dataframe converted from pandas, sometimes I end up with empty dataframes that are a result from the groupby call in koalas. Therefore I get this error.
Is there a workaround for this use case? Will I have to convert pandas to pyspark with a specified schema and then convert to koalas?

mwb222 commented 4 years ago

Ah I see - I haven't run into that situation in my own work so don't know if the workaround above will help, but it might be worth a try to see if the explicit schema carries through in the background. Sorry!

ueshin commented 4 years ago

@ntrang086 How about using type hints for the function you are trying to apply, assuming you are trying to use apply or something which take a user function?

ntrang086 commented 4 years ago

@ntrang086 How about using type hints for the function you are trying to apply, assuming you are trying to use apply or something which take a user function?

The function doesn't return a single value of a specific type like str, float or int, but returns a pandas dataframe or pandas series which can be empty and this has caused the issue described above. Koalas already knows the return type is a pandas dataframe or series and has a problem "accepting" it :) . I could try it what you suggested and see how it goes but not sure it'd work.

ederfdias commented 3 years ago

JFYI... using read_csv function with a column without values I don't receive any errors, but with an read_excel() the same error is raised.

skndrg commented 3 years ago

@ederfdias Here is a possible workaround. Specify converters like below:

import numpy as np

df_ks = koalas.read_excel(
   ...
   converters={i : (lambda x: str(x) if x else np.NaN) for i in range(30)} # Read first 30 columns as string     
skndrg commented 3 years ago

Apparently, np.NaN does the trick

import numpy as np

row =  {'a': [1], 'b':[np.NaN]}
koalas.DataFrame(row).to_spark().where(F.col("b").isNull()).show()

output

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+
itholic commented 3 years ago

Now it works properly on pandas-on-Spark (It's available in Apache Spark 3.2 and above)

I'd recommend to use pandas-on-Spark rather than Koalas since Koalas now only maintenance mode.

>>> import pyspark.pandas as ps
>>> ps.DataFrame()
Empty DataFrame
Columns: []
Index: []
>>> ps.DataFrame([{"A": [None]}])
        A
0  [None]
ederfdias commented 3 years ago

In case some one drop here for any reason. I have tested in koalas 1.8.1 and the problem (with read_excel) does not happens anymore.