Gapminder / gapminder-ai

0 stars 0 forks source link

Couldn't upload Session Result DataFrame into Google Spreadsheet when DataFrame is too big #15

Open semio opened 1 year ago

semio commented 1 year ago

Hi @motin I created a cli tool for running the evaluation and cache results locally and upload to google spreadsheet. It's mostly the same process as in the run_evaluation.py notebook. But somehow I couldn't upload the session results to google spreadsheet.

https://github.com/Gapminder/gapminder-ai/blob/47851d75b34a6ed2b5dc6fdba3344c926eca382c/automation-api/lib/pilot/cli.py#L157

When I enable this line (or append_data(session_df)), it result in an error

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_editor.py:90, in GsheetsWorksheetEditor.replace_data(self, df)
     88 def replace_data(self, df: pd.DataFrame) -> None:
     89     self.data.df = df
---> 90     export_df = self.data.export()
     91     gspread_dataframe.set_with_dataframe(self.worksheet, export_df, resize=True)

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:94, in GsheetsWorksheetData.export(self)
     93 def export(self) -> pa.typing.DataFrame[DfSchemaModel]:
---> 94     return self.restore_current_row_numbers_in_formulas(self.df).rename(
     95         columns=self.attributes_to_columns_map
     96     )

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:88, in GsheetsWorksheetData.restore_current_row_numbers_in_formulas(self, df)
     82             row[col_name] = value.replace(
     83                 self.row_number_placeholder_in_formulas,
     84                 str(row["__row_number"]),
     85             )
     86     return row
---> 88 replaced_df = df.apply(
     89     restore_current_row_numbers, axis=1, result_type="broadcast"
     90 )
     91 return replaced_df.drop(columns=["__row_number"])

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/frame.py:9568, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9557 from pandas.core.apply import frame_apply
   9559 op = frame_apply(
   9560     self,
   9561     func=func,
   (...)
   9566     kwargs=kwargs,
   9567 )
-> 9568 return op.apply().__finalize__(self, method="apply")

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:754, in FrameApply.apply(self)
    752 # broadcasting
    753 if self.result_type == "broadcast":
--> 754     return self.apply_broadcast(self.obj)
    756 # one axis empty
    757 elif not all(self.obj.shape):

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:1013, in FrameColumnApply.apply_broadcast(self, target)
   1012 def apply_broadcast(self, target: DataFrame) -> DataFrame:
-> 1013     result = super().apply_broadcast(target.T)
   1014     return result.T

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:870, in FrameApply.apply_broadcast(self, target)
    867 result_compare = target.shape[0]
    869 for i, col in enumerate(target.columns):
--> 870     res = self.f(target[col])
    871     ares = np.asarray(res).ndim
    873     # must be a scalar or 1d

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:80, in GsheetsWorksheetData.restore_current_row_numbers_in_formulas.<locals>.restore_current_row_numbers(row)
     78 if col_name == "__row_number":
     79     continue
---> 80 value = row[col_name]
     81 if isinstance(value, str) and len(value) > 0 and value[0] == "=":
     82     row[col_name] = value.replace(
     83         self.row_number_placeholder_in_formulas,
     84         str(row["__row_number"]),
     85     )

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'session_id'

But I checked the session_df, it has passed SessionResultsDf validation and the dtypes looks correct.

Then I found that it's possibly caused by the size of session_df, because if I only upload a few rows from session_df, it worked without issue. 1990 rows worked, but 3980 rows not work.

I suggest that, we can check if the DataFrame to be uploaded is too large and give a more meaningful error message instead of KeyError .

Besides, it seems that 3980 rows is not a very big number.. Is it possible to make it work with larger DataFrames? For our usage, I think we are looking at 200 questions 3 models 10 rounds * 10 prompts = 60000 rows

semio commented 1 year ago

Here are some files and script you can test:

from lib.pilot.helpers import read_ai_eval_spreadsheet
import glob
import pandas as pd

sheet = read_ai_eval_spreadsheet()

# download above csv files into a folder
session_df = pd.concat([pd.read_csv(x) for x in glob.glob('./*csv')])
session_df = SessionResultsDf.validate(session_df)

sheet.session_results.replace_data(session_df)