execute_mdx_dataframe() with use_blob: conversion error from dtype('O') to dtype('float64')

151N3 commented 2 months ago

Describe the bug We are trying to export some large amount of data from a cube using execute_mdx_dataframe() and have been trying around which parameters work best. Using the flags:

        use_iterative_json=True,
        skip_rule_derived_cells=True

we got some good results be reducing the output and increasing the performance and ended up from 21mins to just 7mins. When we tried the flag use_blob=True we got an unexcepted error:

Traceback (most recent call last):
  File "parsers.pyx", line 1160, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
...

tm1.cubes.cells.execute_mdx_dataframe(
  File "/opt/airflow2/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1668, in __next__
    return self.get_chunk()
  File "/opt/airflow2/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1777, in get_chunk
    return self.read(nrows=size)
  File "/opt/airflow2/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/opt/airflow2/.venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 855, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 1065, in pandas._libs.parsers.TextReader._convert_column_data
  File "parsers.pyx", line 1166, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '7,098337950138504e-003'

To Reproduce

tm1.cubes.cells.execute_mdx_dataframe(
                            mdx=mdx,
                            skip_rule_derived_cells=True,
                            use_blob=True)

I tried, to introduce the pandas Parameter dtype="string" which worked but i got the values a string like in the error '7,098337950138504e-003' and i wanted to avoid to convert this string the value to float.

Expected behavior Regardless of the flags used, the results should be the same. In case of the use_blob=True L2426 we are not able to receive the same result. Using the pandas to create a DataFrame is a little bit more complex but should be a better way without the need use several pandas parameters to make it work.

Question: In execute_mdx_dataframe_shaped() function there is check: if all([use_blob, use_iterative_json]): raise ValueError("'use_blob' and 'use_iterative_json' must not be used together")

Is it a need for the check for the execute_mdx_dataframe() function? I see that the parameter use_iterative_json is not beeing passed self.execute_mdx_csv() here L2435

Version TM1py 2.0.2 TM1 Server Version: 11.8 pandas Version: 2.1.4

MariusWirtz commented 2 months ago

In Germany, a comma is used as a decimal separator instead of a dot.

the execute_mdx_dataframe function passes all kwargs on to the underlying pd.read_csv function that it uses to consume the TM1 response.

Please pass one additional kwarg to the execute_mdx_dataframe function: decimal=','

MariusWirtz commented 2 months ago

If this resolves the issue, we can fix it in TM1py by explicitly specifying the DatasourceASCIIDecimalSeparator in the _build_cube_to_blob_process function.

https://github.com/cubewise-code/tm1py/blob/8ddc00d94f257671e7e305aecfab9247531f0f88/TM1py/Services/CellService.py#L1419

151N3 commented 2 months ago

In Germany, a comma is used as a decimal separator instead of a dot.

the execute_mdx_dataframe function passes all kwargs on to the underlying pd.read_csv function that it uses to consume the TM1 response.

Please pass one additional kwarg to the execute_mdx_dataframe function: decimal=','

Hey @MariusWirtz, it worked perfectly. I was aware that the function passes all kwargs because i already tried chunksize and dtype parameters and the first one worked very well, so we don't have to upgrade our server where we are executing the python script :-D

Unfortunately I didn't think about the decimal parameter. Thank you very much! :)

151N3 commented 2 weeks ago

Hi @MariusWirtz , we had an edge case with the function where we had some string values in the value column and it made the debugging very difficult. I think this is more a pandas issue but still I think it should be a solution to check the data type of the value column from the TM1 side

dtype = {
    "Jahr": "string",
    "Monat": "string",
     ...
    "Value": "float64"}

df = tm1.cells.execute_mdx_dataframe(
        mdx=mdx,
        use_blob=True,
        decimal=",",
        dtype=dtype)

at some point we had the warning: DtypeWarning: Columns (12) have mixed types. Specify dtype option on import or set low_memory=False. That means, the value column wasn't converted in to a float

Dtypes: 
Jahr               object
Monat             object
...
Value             object

And at some point we get the error:

ValueError: could not convert string to float: '7,571560480147739e-002'

But the real reason was:

       Jahr Monat  ... BM_Kennzahl          Value
93890  2024  M09n  ...   Kommentar  S&T Tag 16&17
94925  2024  M09n  ...   Kommentar  S&T Tag 16&17

Now we know the issue will try to extract the data separately (float and string) so we don't get these kind of errors.

I'm not do happy that in TM1 you can use string and numeric entries for the values and there is no type definition :-)

cubewise-code / tm1py

execute_mdx_dataframe() with use_blob: conversion error from dtype('O') to dtype('float64') #1136