Calgary-Data-Science-Academy / python-study

Python Study Progress Tracking
1 stars 2 forks source link

Chapter 5 Panda #3

Open radium05 opened 5 years ago

radium05 commented 5 years ago
radium05 commented 5 years ago

Study Notes of pandas

Series and DataFrame

Series and DataFrame are two workhourse structures of pandas. Think about the single columns or tables in Excel.

import pandas as pd

1. Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

1.1. Example

obj = pd.Series([4, 7, -5, 3])

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

1.2. Example

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values. This is very important because it allows us to conduct filtering easily.

obj2['a']
obj2[['c','a','d']]

create a Series from it by passing the dict

1.3. Example

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)

Overide the index by passing a new dict

 states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations

obj5 = pd.Series(sdata)
obj4_5 = obj4 + obj5

2. DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). Think about Excel!!!

2.1. Example

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],        'year': [2000, 2001, 2002, 2001, 2002, 2003],        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame = pd.DataFrame(data) 

2.2. Example

 frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])

3. Essential Functionality

Fundamental mechanics of interacting with the data contained in a Series or DataFrame.

3.1. Example

Reindex - series

 obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
 obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

Reindex - DataFrame

 frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
 frame2 = frame.reindex(['a', 'b', 'c', 'd'])

3.2. Example

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values

 obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
 obj3.reindex(range(6), method='ffill') 

3.3. Example

Dropping engries

new_obj = obj.drop('c')
new_frame = frame2.drop(['Ohio'])

3.4. Example

Arithmetic methods with fill values

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1 + df2
df1.add(df2, fill_value=0) 
Method Description
add, radd Methods for addition (+)
sub, rsub Methods for subtraction (-)
div, rdiv Methods for division (/)
floordiv, rfloordiv Methods for floor division (//)
mul, rmul Methods for multiplication (*)
pow, rpow Methods for exponentiation (**)

3.5. Example

Function Application and Mapping NumPy ufuncs (element-wise array methods) also work with pandas objects

frame
frame['val']=[-3,-1,0,3,-4,6]
del frame['state']
np.abs(frame)

3.6. Example

 f = lambda x: x.max() / x.min()
frame.apply(f)
frame.apply(f, axis='columns')
format = lambda x: '%.2f' % x
frame.applymap(format)
frame['e'].map(format) 

3.7. Example

Sorting and Ranking

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) 
frame.sort_index()
frame.sort_index(axis=1) 
frame.sort_index(axis=1, ascending=False)
obj.sort_values()
frame.sort_values(by=['a', 'b'])
obj.rank() 
obj.rank(method='first')
frame.rank(axis='columns')    
Method Description
'average' Default: assign the average rank to each entry in the equal group
'min','max' Use the minimum/maximum rank for the whole group
'first' Assign ranks in the order the values appear in the data
'dense' Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group

3.8. Example

Axis Indexes with Duplicate Labels

obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique
obj['a']

4. Computation

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame

4.1. Example

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'],                 columns=['one', 'two'])
df.sum()
df.sum(axis='columns')
df.mean(axis='columns', skipna=False)

4.2. Example

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3],  'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)

5. Read Files

We will use pandas to read a csv file and conduct some basic calculations. data source: https://www.kaggle.com/crawford/80-cereals/version/2#cereal.csv 80-cereals.zip Download the attached file and unzip it. get your local directory

% pwd
cereal_df = pd.read_csv("data/cereal.csv")
cereal_df1 = pd.read_csv("data/cereal.csv", skiprows = 1, na_values = ['no info', '.'])
cereal_df2 = pd.read_csv("data/cereal.csv", na_values = ['no info', '.'])

import matplotlib, a visualization library for pandas/python

import matplotlib.pyplot as plt
import seaborn as sns

pick a column with numeric variables

calories = cereal_df2[cereal_df2.calories>100]
protein = cereal_df2.protein
potass = cereal_df2['potass']

plot a histogram of the column

plt.hist(calories)
plt.title("Calories in Cereals")