NumEconCopenhagen / lectures-2020

MIT License
12 stars 29 forks source link

Question when merging two dataframes #7

Open PhilipHass opened 4 years ago

PhilipHass commented 4 years ago

For the DataProject we've loaded two dataFrames from DST, one for the National account identity and one for employment, and we want to merge them. Both have the columns "INDHOLD" and "TID" and we want to mergen on "TID", the problem is when we merge them we get two columns for "INDHOLD" in the merged dataframe, "INDHOLD_x" and "INDHOLD_y", but we'd like to have just one column for "INDHOLD"

The code we use for merging them is: NAI.merge(EMP, on = "TID", how = "left")

orstrand commented 4 years ago

HI Philip

Does the "INDHOLD" column contain the same data? If so, you can merge on both columns

pd.merge(df1, df2, on=['A', 'B']) (It will thus only merge those rows that have both identical A and B column values)

Or, alternatively, you can just drop one of the "INDHOLD" columns and rename the other.

If they both contain different string values (I assume that INDHOLD contains the list of values in a given row?), then you can conjugate

df["INDHOLD"] = df["INDHOLD_x"] + df["INDHOLD_y"]

and then drop the two columns.

If not, which data do you wish to have in the INDHOLD column of the merged data frame?

Magnus

AskerNC commented 4 years ago

Looking at your repository at https://github.com/NumEconCopenhagen/projects-2020-group-xxv/blob/master/dataproject/dataproject.ipynb I think the problem is more with the structure of your data. If you merge on year and keep only one "INDHOLD" variable you won't be able to tell whether or not "INDHOLD" shows Total employment or for example GDP. One solution to this would be to rename the "INDHOLD" to "Total employment (1000 persons)" such that each observation for a given year shows the "INDHOLD" for a given TRANSAKT (GDP, P3 etc.) and the Total Employment. (but thats a bit like having INDHOLD_x and INDHOLD_y. I would actually suggest staying true to the format of your data. This means that you want to append the data set instead of merging them so it behaves like the observations in NAI. Does that make sense?

Best regards Asker

PhilipHass commented 4 years ago

I think it worked, by doing an outer-merging on all 3 columns ["TID", "TRANSAKT, "INDHOLD"]. We now have the employment data as rows with the data in the same column as the other variables, just like we wanted. And we we sort by time, we have data for alle variables for a given year, with the all observations in the same "INDHOLD" column. Thanks for the suggestions.