Closed Johannes-Ewald closed 3 years ago
Thanks for this detailed error message. I prefer the second solution but the first one is also fine. @Johannes-Ewald can you make a PR starting from the current master branch? If you have any questions about the workflow please ask @mpetrosian for help.
Solved.
Issue
Due to an issue with pandas one can run into an value error in _skillmodels.pre_processing.data_processor ydata(). We started with a dataframe that included data from all 13 available HRS waves. It has a mult-index (id, individual_period). Here individual_period = 0 means it is the first time an individual is observed. We then dropped all observations from hrs_wave 13. This means that some id's and individual_period = 12 do not occur any longer.
Creating an instance of SkillModel results into an ValueError in y_data():
--> 138 y_data[counter : counter + len(measurements), :] = df.to_numpy().T
ValueError: could not broadcast input array from shape (x,y) into shape (x,z)
Origin
pre_process_data() transforms an unbalanced dataset into a balanced panel. It uses
all_ids, all_periods = list(df.index.levels[0]), list(df.index.levels[1])
nobs = len(all_ids)
nperiods = len(all_periods)
to determine the shape of the balanced panel. Due to a pandas issue index.levels still reports values that have been deleted. See: https://github.com/pandas-dev/pandas/issues/2770 The shape of balanced is too large.
The dimension of y_data is determined by
dims = (self.nupdates, self.nobs)
y_data = np.zeros(dims)
It seems like nobs comes from
self.nobs = int(len(self.data) / self.nperiods)
(row 60 in model_spec_processor.py) where nperiods is the correct number of periods from the model specification file. The expected number of individuals is then too large.Possible solution
df = df.reset_index().set_index(["id", "individual_period"]).sort_index()
df.index.get_level_values("id").unique()
instead oflen(list(dataset.index.levels[0]))
shows the correct value.