Allow successful transformation of pandas df to spark df in version 2.0

cubewise-code / tm1py

TM1py is a Python package that wraps the TM1 REST API in a simple to use library.

http://tm1py.readthedocs.io/en/latest/

MIT License

189 stars 109 forks source link

Allow successful transformation of pandas df to spark df in version 2.0 #1100

Closed ysonali18 closed 3 weeks ago

ysonali18 commented 7 months ago

Describe the bug We are getting typecasting error while transforming pandas df to spark df in version 2.0 whereas in earlier version 1.11.3 it was successfully transforming the dataframes without errors.

To Reproduce Below is the script,

The output with exception error,

Expected behavior In Tm1py 1.11.3 doesn't have the typecasting error. We are expecting it to successfully transform the pandas df to spark df without typecasting error,

Version TM1py 2.0.2 TM1 Server Version: 11.8.00900.3

Additional context

rclapp commented 7 months ago

This appears to be an issue with pyspark's support for NumPy's data types. TM1py doesn't make any promises regarding the interoperability of pandas and pyspark as far as I know. However, it seems to be a pretty simple conversion to fix this issue. https://stackoverflow.com/questions/73169054/not-supported-type-class-numpy-float64

MariusWirtz commented 7 months ago

Since it fails on the column YearLong, have to tried to exclude this particular attribute?

What happens when you exclude this attribute, by explicitly selecting attributes with the attributes argument in the get_elements_dataframe function?

ysonali18 commented 7 months ago

Hi @MariusWirtz, YearLong, StdAnnualLoad, CreditPointsReq, CourseDuration PT and CourseDuration are the attribute for which it is throwing an infer schema exception. After explicitly excluding this attributes using attribute parameter its successfully transforming the pandas df to spark df.

Thanks, Sonali

MariusWirtz commented 6 months ago

This issue may be related to the actual data.

Instead of retrieving all elements, can you please retrieve a couple of elements for which you know the attribute values are consistent and clean?

df = tm1.elements.get_elements_dataframe(dimension_name="d1", hierarchy_name="d1", elements=["e001", "e002"])

Does it work?

MariusWirtz commented 6 months ago

@ysonali18, please execute the below code with 1.11.3 and 2.0.2 and share your findings.

I would expect to see a difference in types or values between the two runs.

with TM1Service(**tm1params) as tm1:
    df = tm1.elements.get_elements_dataframe(dimension_name="d1", hierarchy_name="d1", elements=["e001", "e002"])

    print(df.dtypes)
    print(df.isna())
    print(df.to_csv())

MariusWirtz commented 6 months ago

The second table has gaps. please regenerate but make sure that there are no ... gaps with pd.set_option('display.max_colwidth', None)

It's a bit hard to review in docx. Please also confirm that the output from the to_csv is actually the same with the exception of the level001 and level000 issue.