/Library/Python/2.7/site-packages/pandas/io/excel.py:784: DeprecationWarning: Call to deprecated function remove_sheet (Use wb.remove(worksheet) or del wb[sheetname]).
self.book.remove_sheet(self.book.worksheets[0])
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py:207: DeprecationWarning: Interpreting naive datetime as local 2018-08-16 15:55:04.426237. Please add timezone info to timestamps.
chunks = self.iterencode(o, _one_shot=True)
# delete the DataFrame
del football
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py:207: DeprecationWarning: Interpreting naive datetime as local 2018-08-16 15:56:02.944848. Please add timezone info to timestamps.
chunks = self.iterencode(o, _one_shot=True)
# read from Excel
football = pd.read_excel('football.xlsx', 'Sheet1')
football
year
team
wins
losses
0
2010
Bears
11
5
1
2011
Bears
8
8
2
2012
Bears
10
6
3
2011
Packers
15
1
4
2012
Packers
11
5
5
2010
Lions
6
10
6
2011
Lions
10
6
7
2012
Lions
4
12
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py:207: DeprecationWarning: Interpreting naive datetime as local 2018-08-16 15:56:16.107466. Please add timezone info to timestamps.
chunks = self.iterencode(o, _one_shot=True)
football
losses
team
wins
year
0
5
bears
11
2010
1
8
packer
3
2011
2
6
bears
3
2012
dataframe应用
数据读入
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
encoding='latin-1')
# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
encoding='latin-1')
#查看部分数据
print(users[['age', 'zip_code']].head())
print('\n')
# can also store in a variable to use later
columns_you_want = ['occupation', 'sex']
print(users[columns_you_want].head())
age zip_code
0 24 85711
1 53 94043
2 23 32067
3 24 43537
4 33 15213
occupation sex
0 technician M
1 other F
2 writer M
3 technician M
4 other F
# users older than 25
print(users[users.age > 25].head(3))
print('\n')
# users aged 40 AND male
print(users[(users.age == 40) & (users.sex == 'M')].head(3))
print('\n')
# users younger than 30 OR female
print(users[(users.sex == 'F') | (users.age < 30)].head(3))
user_id age sex occupation zip_code
1 2 53 F other 94043
4 5 33 F other 15213
5 6 42 M executive 98101
user_id age sex occupation zip_code
18 19 40 M librarian 02138
82 83 40 M other 44133
115 116 40 M healthcare 97232
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
print(users.set_index('user_id').head())
print('\n')
print(users.head())
print("\n^^^ I didn't actually change the DataFrame. ^^^\n")
with_new_index = users.set_index('user_id')
print(with_new_index.head())
print("\n^^^ set_index actually returns a new DataFrame. ^^^\n")
age sex occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
^^^ I didn't actually change the DataFrame. ^^^
age sex occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
^^^ set_index actually returns a new DataFrame. ^^^
with_new_index.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 1 to 943
Data columns (total 4 columns):
age 943 non-null int64
sex 943 non-null object
occupation 943 non-null object
zip_code 943 non-null object
dtypes: int64(1), object(3)
memory usage: 36.8+ KB
users.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
user_id 943 non-null int64
age 943 non-null int64
sex 943 non-null object
occupation 943 non-null object
zip_code 943 non-null object
dtypes: int64(2), object(3)
memory usage: 36.9+ KB
<pandas.core.groupby.DataFrameGroupBy object at 0x1100f82d0>
print(by_dept.count().head()) # NOT NULL records within each column
print('\n')
print(by_dept.size().tail()) # total records for each department
name title salary
department
ADMIN HEARNG 42 42 42
ANIMAL CONTRL 61 61 61
AVIATION 1218 1218 1218
BOARD OF ELECTION 110 110 110
BOARD OF ETHICS 9 9 9
department
ADMIN HEARNG 42
ANIMAL CONTRL 61
AVIATION 1218
BOARD OF ELECTION 110
BOARD OF ETHICS 9
dtype: int64
print(by_dept.sum()[20:25]) # total salaries of each department
print('\n')
print(by_dept.mean()[20:25]) # average salary of each department
print('\n')
print(by_dept.median()[20:25]) # take that, RDBMS!
salary
department
HUMAN RESOURCES 4850928.0
INSPECTOR GEN 4035150.0
IPRA 7006128.0
LAW 31883920.2
LICENSE APPL COMM 65436.0
salary
department
HUMAN RESOURCES 71337.176471
INSPECTOR GEN 80703.000000
IPRA 82425.035294
LAW 70853.156000
LICENSE APPL COMM 65436.000000
salary
department
HUMAN RESOURCES 68496.0
INSPECTOR GEN 76116.0
IPRA 82524.0
LAW 66492.0
LICENSE APPL COMM 65436.0
#我们想看title最多的5个部门
#SELECT department, COUNT(DISTINCT title)
#FROM chicago
#GROUP BY department
#ORDER BY 2 DESC
#LIMIT 5;
by_dept.title.nunique().sort_values(ascending=False)[:5]
department
WATER MGMNT 153
TRANSPORTN 150
POLICE 130
AVIATION 125
HEALTH 118
Name: title, dtype: int64
def ranker(df):
"""Assigns a rank to each employee based on salary, with 1 being the highest paid.
Assumes the data is DESC sorted."""
df['dept_rank'] = np.arange(len(df)) + 1
return df
chicago.sort_values('salary', ascending=False, inplace=True)
chicago = chicago.groupby('department').apply(ranker)
print(chicago[chicago.dept_rank == 1].head(7))
name title department \
18039 MC CARTHY, GARRY F SUPERINTENDENT OF POLICE POLICE
8004 EMANUEL, RAHM MAYOR MAYOR'S OFFICE
25588 SANTIAGO, JOSE A FIRE COMMISSIONER FIRE
763 ANDOLINO, ROSEMARIE S COMMISSIONER OF AVIATION AVIATION
4697 CHOUCAIR, BECHARA N COMMISSIONER OF HEALTH HEALTH
21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW
12635 HOLT, ALEXANDRA D BUDGET DIR BUDGET & MGMT
salary dept_rank
18039 260004.0 1
8004 216210.0 1
25588 202728.0 1
763 186576.0 1
4697 177156.0 1
21971 173664.0 1
12635 169992.0 1
chicago[chicago.department == "LAW"][:5]
name
title
department
salary
dept_rank
21971
PATTON, STEPHEN R
CORPORATION COUNSEL
LAW
173664.0
1
6311
DARLING, LESLIE M
FIRST ASST CORPORATION COUNSEL
LAW
149160.0
2
17680
MARTINICO, JOSEPH P
CHIEF LABOR NEGOTIATOR
LAW
144036.0
3
22357
PETERS, LYNDA A
CITY PROSECUTOR
LAW
139932.0
4
31383
WONG JR, EDWARD J
DEPUTY CORPORATION COUNSEL
LAW
137076.0
5
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
encoding='latin-1')
# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5),
encoding='latin-1')
# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)
title
Star Wars (1977) 583
Contact (1997) 509
Fargo (1996) 508
Return of the Jedi (1983) 507
Liar Liar (1997) 485
English Patient, The (1996) 481
Scream (1996) 478
Toy Story (1995) 452
Air Force One (1997) 431
Independence Day (ID4) (1996) 429
Raiders of the Lost Ark (1981) 420
Godfather, The (1972) 413
Pulp Fiction (1994) 394
Twelve Monkeys (1995) 392
Silence of the Lambs, The (1991) 390
Jerry Maguire (1996) 384
Chasing Amy (1997) 379
Rock, The (1996) 378
Empire Strikes Back, The (1980) 367
Star Trek: First Contact (1996) 365
Back to the Future (1985) 350
Titanic (1997) 350
Mission: Impossible (1996) 344
Fugitive, The (1993) 336
Indiana Jones and the Last Crusade (1989) 331
dtype: int64
# 或者
lens.title.value_counts()[:25]
Star Wars (1977) 583
Contact (1997) 509
Fargo (1996) 508
Return of the Jedi (1983) 507
Liar Liar (1997) 485
English Patient, The (1996) 481
Scream (1996) 478
Toy Story (1995) 452
Air Force One (1997) 431
Independence Day (ID4) (1996) 429
Raiders of the Lost Ark (1981) 420
Godfather, The (1972) 413
Pulp Fiction (1994) 394
Twelve Monkeys (1995) 392
Silence of the Lambs, The (1991) 390
Jerry Maguire (1996) 384
Chasing Amy (1997) 379
Rock, The (1996) 378
Empire Strikes Back, The (1980) 367
Star Trek: First Contact (1996) 365
Titanic (1997) 350
Back to the Future (1985) 350
Mission: Impossible (1996) 344
Fugitive, The (1993) 336
Indiana Jones and the Last Crusade (1989) 331
Name: title, dtype: int64
title
Toy Story (1995) 0.119994
Twelve Monkeys (1995) 0.300315
Dead Man Walking (1995) -0.043452
Mr. Holland's Opus (1995) -0.244160
Braveheart (1995) 0.031136
Name: diff, dtype: float64
pandas是python的一个用于数据分析的函数库,通过数据结构作为入门的出发点,我们来看一下如何使用这个函数库。
pandas只有2种数据结构,series 和 dataframe
dataframe应用
数据读入
数据概览
2个表之间的操作
JOIN
Combining 合并,直接将2个表相加,通过参数控制水平方向相加还是数值方向
split-apply-combine 的策略
What are the 25 most rated movies?
Which movies are most highly rated?
pandas 绑定了 matplotlib
绑定用户
那部电影男、女的评价差异最大