Maksim-Sheyngalts / PythonTool

1 stars 0 forks source link

DataFrames #3

Open Maksim-Sheyngalts opened 2 years ago

Maksim-Sheyngalts commented 2 years ago

How to Merge DataFrames in Pandas - merge(), join(), append(), concat() and update()

Merge DataFrames Using join()

Unlike merge() which is a method of the Pandas instance, join() is a method of the DataFrame itself. This means that we can use it like a static method on the DataFrame: DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False).

The DataFrame we call join() from will be our left DataFrame. The DataFrame in the other argument would be our right DataFrame.

The on parameter can take one or more (['key1', 'key2' ...]) arguments to define the matching key, while how parameter takes one of the handle arguments (left, right, outer, inner), and it's set to left by default.

Let's try to join df2 to df1:

df_join = df1.join(df2, rsuffix='_right') print(df_join)

Like the merge() function, the join() function automatically tries to match the keys (columns) with the same name. In our case, it's the user_id key.

The above code prints out this:

user_id first_name last_name email user_id_right image_url 0 id001 Rivi Valti rvalti0@example.com id001 http://example.com/img/id001.png 1 id002 Wynnie McMurty wmcmurty1@example.com id002 http://example.com/img/id002.jpg 2 id003 Kristos Ivanets kivanets2@example.com id003 http://example.com/img/id003.bmp 3 id004 Madalyn Max mmax3@example.com id004 http://example.com/img/id004.jpg 4 id005 Tobe Riddich triddich4@example.com id005 http://example.com/img/id005.png 5 id006 Regan Huyghe rhuyghe@example.com NaN NaN 6 id007 Kristin Illis killis4@example.com NaN NaN

You probably noticed a "duplicate column" called user_id_right. If you don't want to display that column, you can set the user_id columns as an index on both columns so it would join without a suffix:

df_join_no_duplicates = df1.set_index('user_id').join(df2.set_index('user_id')) print(df_join_no_duplicates)

By doing so, we are getting rid of the user_id column and setting it as the index column instead. This provides us with a cleaner resulting DataFrame:

first_name last_name email image_url user_id id001 Rivi Valti rvalti0@example.com http://example.com/img/id001.png id002 Wynnie McMurty wmcmurty1@example.com http://example.com/img/id002.jpg id003 Kristos Ivanets kivanets2@example.com http://example.com/img/id003.bmp id004 Madalyn Max mmax3@example.com http://example.com/img/id004.jpg id005 Tobe Riddich triddich4@example.com http://example.com/img/id005.png id006 Regan Huyghe rhuyghe@example.com NaN id007 Kristin Illis killis4@example.com NaN

—--------------- Merge DataFrames

Using append()

As the official Pandas documentation points, since concat() and append() methods return new copies of DataFrames, overusing these methods can affect the performance of your program.

Append is very useful when you want to merge two DataFrames in row axis only. This means that instead of matching data on their columns, we want a new DataFrame that contains all the rows of 2 DataFrames.

Let's append df2 to df1 and print the results:

df_append = df1.append(df2, ignore_index=True) print(df_append)

Using append() will not match DataFrames on any keys. It will just add the other DataFrame to the first and return a copy of it. If the shapes of DataFrames do not match, Pandas will replace any unmatched cells with a NaN.

The output for appending the two DataFrames looks like this:

user_id first_name last_name email image_url 0 id001 Rivi Valti rvalti0@example.com NaN 1 id002 Wynnie McMurty wmcmurty1@example.com NaN 2 id003 Kristos Ivanets kivanets2@example.com NaN 3 id004 Madalyn Max mmax3@example.com NaN 4 id005 Tobe Riddich triddich4@example.com NaN 5 id006 Regan Huyghe rhuyghe@example.com NaN 6 id007 Kristin Illis killis4@example.com NaN 7 id001 NaN NaN NaN http://example.com/img/id001.png 8 id002 NaN NaN NaN http://example.com/img/id002.jpg 9 id003 NaN NaN NaN http://example.com/img/id003.bmp 10 id004 NaN NaN NaN http://example.com/img/id004.jpg 11 id005 NaN NaN NaN http://example.com/img/id005.png

Merge DataFrames Using concat()

Concatenation is a bit more flexible when compared to merge() and join() as it allows us to combine DataFrames either vertically (row-wise) or horizontally (column-wise).

The trade-off is that any data that doesn't match will be discarded. Here's the full function with the parameters:

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Here are the most commonly used parameters for the concat() function:

objs is the list of DataFrame objects ([df1, df2, ...]) to be concatenated

axis defines the direction of the concatenation, 0 for row-wise and 1 for column-wise

join can either be inner (intersection) or outer (union)

ignore_index by default set to False which allows the index values to remain as they were in the original DataFrames, can cause duplicate index values. If set to True, it will ignore the original values and re-assign index values in sequential order

keys allows us to construct a hierarchical index. Think of it as another level of the index that appended on the outer left of the DataFrame that helps us to distinguish indices when values are not unique

Let's create a new DataFrame with the same column types with the df2, but this one includes the image_url for id006 and id007:

df2_addition = pd.DataFrame({'user_id': ['id006', 'id007'], 'image_url': ['http://example.com/img/id006.png', 'http://example.com/img/id007.jpg'] })

In order to join df2 and df2_addition row-wise, we can pass them in a list as the objs parameter and assign the resulting DataFrame to a new variable:

df_row_concat = pd.concat([df2, df2_addition]) print(df_row_concat)

We successfully filled in the missing values:

user_id image_url 0 id001 http://example.com/img/id001.png 1 id002 http://example.com/img/id002.jpg 2 id003 http://example.com/img/id003.bmp 3 id004 http://example.com/img/id004.jpg 4 id005 http://example.com/img/id005.png 0 id006 http://example.com/img/id006.png 1 id007 http://example.com/

However, have a look at the indices in the left-most column. The indices 0 and 1 are repeating. To get entirely new and unique index values, we pass True to the ignore_index parameter:

df_row_concat = pd.concat([df2, df2_addition], ignore_index=True)

combine_first() and update()

In some cases, you might want to fill the missing data in your DataFrame by merging it with another DataFrame. By doing so, you will keep all the non-missing values in the first DataFrame while replacing all NaN values with available non-missing values from the second DataFrame (if there are any).

For this example, we will import NumPy to use NaN values. If you installed Pandas with pip, NumPy should already be installed.

Type the following code in your Python shell or script file:

import numpy as np df_first = pd.DataFrame({'COL 1': ['X', 'X', np.nan], 'COL 2': ['X', np.nan, 'X'], 'COL 3': [np.nan, 'X', 'X']}, index=range(0, 3)) df_second = pd.DataFrame({'COL 1': [np.nan, 'O', 'O'], 'COL 2': ['O', 'O', 'O']}, index=range(0, 3)) print(df_first) print(df_second)

The df_first DataFrame has 3 columns

and 1 missing value in each of them:

COL 1 COL 2 COL 3 0 X X NaN 1 X NaN X 2 NaN X X

While df_second has only 2 columns and one missing value in the first column:

COL 1 COL 2 0 NaN O 1 O O 2 O O

We can use df_second to patch missing values in df_first with all corresponding values:

df_tictactoe = df_first.combine_first(df_second) print(df_tictactoe)

As mentioned earlier, using the combine_first() method will only replace NaN values in index wise order, and it will leave every non-missing values in the first DataFrame as they are:

COL 1 COL 2 COL 3 0 X X NaN 1 X O X 2 O X X

On the other hand, if we wanted to overwrite the values in df_first with the corresponding values from df_second (regardless they are NaN or not), we would use the update() method.

Let's first add a another DataFrame to our code:

df_third = pd.DataFrame({'COL 1': ['O'], 'COL 2': ['O'], 'COL 3': ['O']}) print(df_third)

The shape is (1, 3) - 1 row and three columns, excluding the index:

COL 1 COL 2 COL 3 0 O O O

Now let's update the df_first with the values from df_third:

df_first.update(df_third) print(df_first)

Keep in mind that unlike combine_first(), update() does not return a new DataFrame. It modifies the df_first in-place, altering the corresponding values:

COL 1 COL 2 COL 3 0 O O O 1 X NaN X 2 NaN X X

The overwrite parameter of the update() function is set to True by default. This is why it changes all corresponding values, instead of only NaN values. We can change it to False to replace only NaN values:

df_tictactoe.update(df_first, overwrite=False) print(df_tictactoe)

Here's the final state of our df_tictactoe DataFrame:

COL 1 COL 2 COL 3 0 X X O 1 X O X 2 O X X

Not only did we successfully update the values, but we also won the Tic-Tac-Toe game!

Conclusion

Pandas provides powerful tools for merging DataFrames. But it can be hard to decide when to use what. While most of the times merge() function is sufficient, for some cases you might want to use concat() to merge row-wise, or use join() with suffixes, or get rid of missing values with combine_first() and update(). You can even add rows of data with append().

Use the function that you're most comfortable with, and is best for the task at hand. How would these functions help you manipulate data in Pandas?

https://stackabuse.com/how-to-merge-dataframes-in-pandas/