OpenSourceEconomics / skillmodels

MIT License
11 stars 5 forks source link

Value error in data_processor.py after deleting rows #51

Closed Johannes-Ewald closed 3 years ago

Johannes-Ewald commented 4 years ago

Issue

Due to an issue with pandas one can run into an value error in _skillmodels.pre_processing.data_processor ydata(). We started with a dataframe that included data from all 13 available HRS waves. It has a mult-index (id, individual_period). Here individual_period = 0 means it is the first time an individual is observed. We then dropped all observations from hrs_wave 13. This means that some id's and individual_period = 12 do not occur any longer.

Creating an instance of SkillModel results into an ValueError in y_data(): --> 138 y_data[counter : counter + len(measurements), :] = df.to_numpy().T

ValueError: could not broadcast input array from shape (x,y) into shape (x,z)

Origin

pre_process_data() transforms an unbalanced dataset into a balanced panel. It uses

all_ids, all_periods = list(df.index.levels[0]), list(df.index.levels[1]) nobs = len(all_ids) nperiods = len(all_periods)

to determine the shape of the balanced panel. Due to a pandas issue index.levels still reports values that have been deleted. See: https://github.com/pandas-dev/pandas/issues/2770 The shape of balanced is too large.

The dimension of y_data is determined by dims = (self.nupdates, self.nobs) y_data = np.zeros(dims)

It seems like nobs comes from self.nobs = int(len(self.data) / self.nperiods) (row 60 in model_spec_processor.py) where nperiods is the correct number of periods from the model specification file. The expected number of individuals is then too large.

Possible solution

  1. After deleting rows, resetting and setting the index circumvents this problem. Here: df = df.reset_index().set_index(["id", "individual_period"]).sort_index()
  2. Using df.index.get_level_values("id").unique() instead of len(list(dataset.index.levels[0])) shows the correct value.
janosg commented 4 years ago

Thanks for this detailed error message. I prefer the second solution but the first one is also fine. @Johannes-Ewald can you make a PR starting from the current master branch? If you have any questions about the workflow please ask @mpetrosian for help.

janosg commented 3 years ago

Solved.