Open radium05 opened 5 years ago
Series and DataFrame are two workhourse structures of pandas. Think about the single columns or tables in Excel.
import pandas as pd
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
obj = pd.Series([4, 7, -5, 3])
The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values. This is very important because it allows us to conduct filtering easily.
obj2['a']
obj2[['c','a','d']]
create a Series from it by passing the dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
Overide the index by passing a new dict
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations
obj5 = pd.Series(sdata)
obj4_5 = obj4 + obj5
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). Think about Excel!!!
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame = pd.DataFrame(data)
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
Fundamental mechanics of interacting with the data contained in a Series or DataFrame.
Reindex - series
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
Reindex - DataFrame
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
Dropping engries
new_obj = obj.drop('c')
new_frame = frame2.drop(['Ohio'])
Arithmetic methods with fill values
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1 + df2
df1.add(df2, fill_value=0)
Method | Description |
---|---|
add, radd | Methods for addition (+) |
sub, rsub | Methods for subtraction (-) |
div, rdiv | Methods for division (/) |
floordiv, rfloordiv | Methods for floor division (//) |
mul, rmul | Methods for multiplication (*) |
pow, rpow | Methods for exponentiation (**) |
Function Application and Mapping NumPy ufuncs (element-wise array methods) also work with pandas objects
frame
frame['val']=[-3,-1,0,3,-4,6]
del frame['state']
np.abs(frame)
f = lambda x: x.max() / x.min()
frame.apply(f)
frame.apply(f, axis='columns')
format = lambda x: '%.2f' % x
frame.applymap(format)
frame['e'].map(format)
Sorting and Ranking
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])
frame.sort_index()
frame.sort_index(axis=1)
frame.sort_index(axis=1, ascending=False)
obj.sort_values()
frame.sort_values(by=['a', 'b'])
obj.rank()
obj.rank(method='first')
frame.rank(axis='columns')
Method | Description |
---|---|
'average' | Default: assign the average rank to each entry in the equal group |
'min','max' | Use the minimum/maximum rank for the whole group |
'first' | Assign ranks in the order the values appear in the data |
'dense' | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group |
Axis Indexes with Duplicate Labels
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique
obj['a']
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df.sum()
df.sum(axis='columns')
df.mean(axis='columns', skipna=False)
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
We will use pandas to read a csv file and conduct some basic calculations. data source: https://www.kaggle.com/crawford/80-cereals/version/2#cereal.csv 80-cereals.zip Download the attached file and unzip it. get your local directory
% pwd
cereal_df = pd.read_csv("data/cereal.csv")
cereal_df1 = pd.read_csv("data/cereal.csv", skiprows = 1, na_values = ['no info', '.'])
cereal_df2 = pd.read_csv("data/cereal.csv", na_values = ['no info', '.'])
import matplotlib.pyplot as plt
import seaborn as sns
calories = cereal_df2[cereal_df2.calories>100]
protein = cereal_df2.protein
potass = cereal_df2['potass']
plt.hist(calories)
plt.title("Calories in Cereals")
[ ] Special topic on optimization by Dr. Heng Li @liheng0407
[ ] Chapter 5 Panda study led by Lei Xiao @radium05