Open frlm opened 1 month ago
Hey there, @frlm! I'm here to assist you with any bugs, questions, or contributions you have. Let's tackle this together!
The error you're encountering is due to the pivot_df
function trying to access a column named 'Births' that doesn't exist in the DataFrame. The provided bug fix addresses this issue by splitting the first column of the DataFrame using a semicolon (;
) and expanding it into multiple columns, then reassigning the original column names. This ensures that the DataFrame has the correct structure before the pivot operation is attempted.
Here's a summary of the fix:
This fix should resolve the KeyError
by ensuring that the DataFrame has the expected columns before the pivot operation is performed. Make sure that the DataFrame is correctly formatted before calling pivot_df
, as this function expects a well-structured DataFrame with the necessary columns for pivoting.
To continue talking to Dosu, mention @dosu.
Hi, I have the very identical issue and I see you found the solution as well. I tested your solution and it works for me. It would be nice to have it fixed in a release
Best Regards Francesco
@frlm would you be willing to open a PR with the change? That would be fantastic, and could auto-close this issue if/when merged.
@frim @frabenetti do either of you have an easy to reproduce test case? It seems trivial to take this small change and open a PR, but it would be nice to have a failing unit test we can make pass here, or at least a way to manually test the PR for success.
CC @villebro @kgabryje in case either of you (or someone you can hot-potato this to) have any interest here.
Hi, I'm sorry for the late, next week I will prepare a PR with the fix. Best regards
Il ven 15 nov 2024, 23:32 Evan Rusackas @.***> ha scritto:
@frim https://github.com/frim @frabenetti https://github.com/frabenetti do either of you have an easy to reproduce test case? It seems trivial to take this small change and open a PR, but it would be nice to have a failing unit test we can make pass here, or at least a way to manually test the PR for success.
CC @villebro https://github.com/villebro @kgabryje https://github.com/kgabryje in case either of you (or someone you can hot-potato this to) have any interest here.
— Reply to this email directly, view it on GitHub https://github.com/apache/superset/issues/30658#issuecomment-2480054582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWXUXJBUKEZSR3VNKJ7DH4D2AZZBLAVCNFSM6AAAAABQJ65EYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBQGA2TINJYGI . You are receiving this because you were mentioned.Message ID: @.***>
Title: fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv
Function: apply_post_process
The issue is that pd.read_csv
uses the default values of pandas instead of the parameters defined in CSV_EXPORT
in superset_config
. This problem is rarely noticeable when using the separator ,
and the decimal .
. However, with the configuration CSV_EXPORT='{"encoding": "utf-8", "sep": ";", "decimal": ","}'
, the issue becomes evident. This change ensures that pd.read_csv
uses the parameters defined in CSV_EXPORT
.
Steps to reproduce error:
CSV_EXPORT
with the following parameters:
CSV_EXPORT = {
"encoding": "utf-8",
"sep": ";",
"decimal": ","
}
Click on Download > Export to Pivoted .CSV
Download is blocked by an error.
Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):
,state;name;sum__num
0,other;Michael;1047996
1,other;Christopher;803607
2,other;James;749686
Fix: Added a bug fix to read data with right CSV_EXPORT settings
Code Changes:
elif query["result_format"] == ChartDataResultFormat.CSV:
df = pd.read_csv(StringIO(data),
delimiter=superset_config.CSV_EXPORT.get('sep'),
encoding=superset_config.CSV_EXPORT.get('encoding'),
decimal=superset_config.CSV_EXPORT.get('decimal'))
Complete Code
def apply_post_process(
result: dict[Any, Any],
form_data: Optional[dict[str, Any]] = None,
datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
form_data = form_data or {}
viz_type = form_data.get("viz_type")
if viz_type not in post_processors:
return result
post_processor = post_processors[viz_type]
for query in result["queries"]:
if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
raise Exception( # pylint: disable=broad-exception-raised
f"Result format {query['result_format']} not supported"
)
data = query["data"]
if isinstance(data, str):
data = data.strip()
if not data:
# do not try to process empty data
continue
if query["result_format"] == ChartDataResultFormat.JSON:
df = pd.DataFrame.from_dict(data)
elif query["result_format"] == ChartDataResultFormat.CSV:
df = pd.read_csv(StringIO(data),
delimiter=superset_config.CSV_EXPORT.get('sep'),
encoding=superset_config.CSV_EXPORT.get('encoding'),
decimal=superset_config.CSV_EXPORT.get('decimal'))
# convert all columns to verbose (label) name
if datasource:
df.rename(columns=datasource.data["verbose_map"], inplace=True)
processed_df = post_processor(df, form_data, datasource)
query["colnames"] = list(processed_df.columns)
query["indexnames"] = list(processed_df.index)
query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
query["rowcount"] = len(processed_df.index)
# Flatten hierarchical columns/index since they are represented as
# `Tuple[str]`. Otherwise encoding to JSON later will fail because
# maps cannot have tuples as their keys in JSON.
processed_df.columns = [
" ".join(str(name) for name in column).strip()
if isinstance(column, tuple)
else column
for column in processed_df.columns
]
processed_df.index = [
" ".join(str(name) for name in index).strip()
if isinstance(index, tuple)
else index
for index in processed_df.index
]
if query["result_format"] == ChartDataResultFormat.JSON:
query["data"] = processed_df.to_dict()
elif query["result_format"] == ChartDataResultFormat.CSV:
buf = StringIO()
processed_df.to_csv(buf)
buf.seek(0)
query["data"] = buf.getvalue()
return result
Bug description
Function: pivot_df
Error: The function pivot_df raised a KeyError when trying to pivot the DataFrame due to a missing column.
Log:
Steps to reproduce error:
Click on Download > Export to Pivoted .CSV
Download is blocked by an error.
Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):
Fix: Added a bug fix to split the first column using ";" and expand it into multiple columns, then reassign the original column names.
Code Changes:
Complete Code
Screenshots/recordings
No response
Superset version
4.0.2
Python version
3.10
Node version
16
Browser
Chrome
Additional context
No response
Checklist