Maksim-Sheyngalts / PythonTool

1 stars 0 forks source link

Python_Pandas_Cheat_Sheet #5

Open Maksim-Sheyngalts opened 2 years ago

Maksim-Sheyngalts commented 2 years ago

https://github.com/deganza/jupyter_pandas_cheat_sheet/blob/main/Jupyter_Pandas_Cheat_Sheet.ipynb

Maksim-Sheyngalts commented 2 years ago

https://github.com/deganza/jupyter_pandas_cheat_sheet/blob/main/Jupyter_Pandas_Cheat_Sheet.ipynb

Maksim-Sheyngalts commented 2 years ago

Introduction

In [ ]:

import pandas as pd import numpy as np

In [ ]:

import csv-File in dataframe df df = pd.read_csv('Data/IMDB_Movie_Data.csv' ,sep=",") df

In [ ]:

export dataframe df to csv file my-data df.to_csv ('Data/my-data.csv' , index = False, header=True)

Exploring Data

In [ ]:

df.head() # first five rows

In [ ]:

df.tail() # last five rows

In [ ]:

df.sample(5) # random sample of rows

In [ ]:

df.shape # number of rows/columns

In [ ]:

df.describe() # calculates measures of central tendency

In [ ]:

df.info() # memory footprint and datatypes

Statistics

In [ ]:

df.describe() # Summary statistics for numerical columns

In [ ]:

df.mean() # Returns the mean of all columns

In [ ]:

df.corr() # Returns the correlation between columns in a DataFrame

In [ ]:

df.count() # Returns the number of non-null values in each DataFrame column

In [ ]:

df.max() # Returns the highest value in each column

In [ ]:

df.min() # Returns the lowest value in each column

In [ ]:

df.median() # Returns the median of each column

In [ ]:

df.std() # Returns the standard deviation of each column

Import Data from Files

 Import csv

In [ ]:

df = pd.read_csv('Data/my-data.csv' ,sep=",") df

Import xls

In [ ]:

df = pd.read_excel('Data/my-data.xlsx')

In [ ]:

df = pd.read_excel('Data/my-data.xlsx', sheetname='sheet1', skiprows=[1] # header data )

Export Data to files

Export xls

In [ ]:

df.to_excel('Data/my-data.xlsx')

Export csv

In [ ]:

df.to_csv ('Data/my-data.csv' , index = False, header=True)

Column manipulation

Column Filter

In [ ]:

df[['Title','Rating']]

In [ ]:

df.filter(['Title','Rating'])

Column Rename

In [ ]:

df.rename(columns={'Title': 'a', 'Rating': 'c'},inplace=True) df

In [ ]:

rename the columns back df.rename(columns={'a': 'Title', 'c': 'Rating'},inplace=True) df

Column Resorter/Reorder

In [ ]:

show column values df.columns.values

In [ ]:

reorder Rating after Title df[['Title', 'Rating','Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime (Minutes)', 'Votes', 'Revenue (Millions)', 'Metascore']]

Constant Value Column

In [ ]:

df['new_column'] = 23 df.head()

Math Formula

In [ ]:

df['Rating_Votes'] = df.Rating + df.Votes df[['Rating_Votes','Rating','Votes']].head()

Number to String

In [ ]:

df['Year_str'] =df['Year'].astype(str) df.info()

String to Number

In [ ]:

df['Year_int'] =df['Year_str'].astype(int) df.info()

Double to Int

In [ ]:

df['Rating_int'] = df['Rating'].round(0).astype(int) df[['Rating_int','Rating']].head()

String Replacer

In [ ]:

df['Title'].replace('Prometheus', 'Alien') df[df.Title == 'Prometheus']

String Manipulation

In [ ]:

lower cases df['Title2'] = df['Title'].str.lower() df[['Title2','Title']].head()

In [ ]:

upper cases df['Title2'] = df['Title'].str.upper() df[['Title2','Title']].head()

In [ ]:

length of words df['Title2'] = df['Title'].str.len() df[['Title2','Title']].head()

In [ ]:

first word df['Title2'] = df['Title'].str.split(' ').str[0] df[['Title2','Title']].head()

In [ ]:

find the word "Squad" in Title df['Title2'] = df['Title'].str.find('Squad', 0) df[['Title2','Title']].head()

Date manipulation

In [ ]:

pd.to_datetime('2010/11/12')

Sort

In [ ]:

df.sort_values(by='Title', ascending=True)

In [ ]:

df.sort_values(by=['Director','Year'], ascending=True)

Row manipulation

Row Filter

In [ ]:

select Title 'Prometheus' df[df.Title == 'Prometheus']

In [ ]:

select Rating greater or equal 8.5 df[df.Rating >= 8.5]

In [ ]:

select Year equal 2016 and Rating greater or equal 8.5 df[(df.Year == 2016) & (df.Rating >= 8.5)]

In [ ]:

select Title with 'Prometheus','Sing', 'Guardians of the Galaxy' titel = ['Prometheus','Sing', 'Guardians of the Galaxy'] df[df.Title.isin(titel)]

In [ ]:

select years in 2010,2015,002 years = [2010,2015,2002] df[df.Year.isin(years)]

In [ ]:

Selects rows 1-to-3 df.iloc[0:3]

Maksim-Sheyngalts commented 2 years ago

select years in 2010,2015,002 years = [2010,2015,2002] df[df.Year.isin(years)]

In [ ]:

Selects rows 1-to-3 df.iloc[0:3]

In [ ]:

First 4 rows and first 2 columns df.iloc[0:4, 0:2]

Table Manipulation

Group By

In [ ]:

number of titles per year df.groupby("Year")["Title"].count().to_frame()

In [ ]:

number of titles per year and per director df.groupby(["Year","Director"])["Title"].count().to_frame().reset_index()

In [ ]:

number of titles per director df.groupby(["Director"])["Title"].count().to_frame(name = 'count').reset_index()

In [ ]:

Total revenue per year and per director df.groupby(["Year","Director"])["Revenue (Millions)"].sum().to_frame().reset_index()

In [ ]:

Rating-Mean per director df.groupby("Director")["Rating"].mean().to_frame().reset_index()

In [ ]:

combination of different group by functions df.groupby(["Year","Director"]).agg( { 'Title':"count", # number of titles per year and director 'Rating':"mean", # Rating-Mean per director 'Revenue (Millions)': "sum" # Total revenue per year and director } ).reset_index()

Pivot / Unpivot

In [ ]:

Pivot over Director and mean over all other columns pd.pivot_table(df,index=["Director"]).reset_index()

In [ ]:

Pivot with sum df_rev_sum = pd.pivot_table(df,index=["Director","Year"],values=["Revenue (Millions)"],aggfunc=np.sum).reset_index() df_rev_sum

In [ ]:

Unpivot over years df_rating = pd.pivot_table(df,values=['Rating'], columns=['Year']).reset_index() df_rating

In [ ]:

df4.melt(id_vars=['index'],var_name='Year',value_name='Title')

Join

In [ ]:

create new dataframe "df_dir_movies" df_dir_movies = df.groupby(["Director"])["Title"].count().to_frame(name = 'number of movies').reset_index() df_dir_movies

In [ ]:

create new dataframe "df_dir_rev" df_dir_rev = df.groupby(["Director"])["Revenue (Millions)"].sum().to_frame(name = 'Revenue').reset_index() df_dir_rev

In [ ]:

join the dataframe "df_dir_movies" with "df_dir_rev" # how = rigtht, left, inner or outer pd.merge(df_dir_movies,df_dir_rev, left_on=['Director'], right_on=['Director'],how = 'left')

Concatenate

In [ ]:

df2 = df df.append(df2) # Append df2 to df (The columns must be the same in both dataframes)

In [ ]:

pd.concat([df, df2],axis=0) # concatenate two dataframes

Import Data from Databases

Import from mysql

In [ ]:

import pymysql conn = pymysql.connect(host='localhost',port=3306, db='database',user='root',password='pw')

Maksim-Sheyngalts commented 2 years ago

df = pd.read_sql_query( "SELECT * FROM table LIMIT 3;", conn) df.tail(100)

Import Teradata

In [ ]:

import teradata #Make a connection session = udaExec.connect(method="odbc", USEREGIONALSETTINGS="N", system= "dwh", username = "root", password = "pw"); query = "SELECT * FROM DATABASEX.TABLENAMEX" #Reading query to df df = pd.read_sql(query,session) # do something with df,e.g. print(df.head()) #to see the first 5 rows

Import SAP-Hana

In [ ]:

import pyhdb connection = pyhdb.connect( host="localhost", port=30015, user="root, password="pw" ) print(connection.isconnected()) query = "SELECT * FROM HDB_REPORT.\"Table\"" df = pd.read_sql(query,connection) # do something with df,e.g. print(df.head()) #to see