一、开始了解你的数据

探索Chipotle快餐数据

将数据集存入一个名为chipo的数据框内
查看前10行内容
数据集中有多少个列(columns)？
打印出全部的列名称
数据集的索引是怎样的？
被下单数最多商品(item)是什么?
在item_name这一列中，一共有多少种商品被下单？
在choice_description中，下单次数最多的商品是什么？
一共有多少商品被下单？
将item_price转换为浮点数
在该数据集对应的时期内，收入(revenue)是多少？
在该数据集对应的时期内，一共有多少订单？
每一单(order)对应的平均总价是多少？

二、数据过滤与排序

探索2012欧洲杯数据

将数据集命名为euro12
只选取 Goals 这一列
有多少球队参与了2012欧洲杯？
该数据集中一共有多少列(columns)?
将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框
对数据框discipline按照先Red Cards再Yellow Cards进行排序
计算每个球队拿到的黄牌数的平均值
找到进球数Goals超过6的球队数据
选取以字母G开头的球队数据
选取前7列
选取除了最后3列之外的全部列
找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

三、数据分组

探索酒类消费数据

将数据框命名为drinks
哪个大陆(continent)平均消耗的啤酒(beer)更多？
打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值
打印出每个大陆每种酒类别的消耗平均值
打印出每个大陆每种酒类别的消耗中位数
打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值

四、Apply函数

探索1960 - 2014 美国犯罪数据

将数据框命名为crime
每一列(column)的数据类型是什么样的？
将Year的数据类型转换为 datetime64
将列Year设置为数据框的索引
删除名为Total的列
按照Year（每十年）对数据框进行分组并求和
何时是美国历史上生存最危险的年代？

五、合并

探索虚拟姓名数据

创建DataFrame
将上述的DataFrame分别命名为data1, data2, data3
将data1和data2两个数据框按照行的维度进行合并，命名为all_data
将data1和data2两个数据框按照列的维度进行合并，命名为all_data_col
打印data3
按照subject_id的值对all_data和data3作合并
对data1和data2按照subject_id作连接

找到 data1 和 data2 合并之后的所有匹配结果


import pandas as pd
import numpy as np
raw_data_1 = {
    'subject_id': ['1', '2', '3', '4', '5'],
    'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
    'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}


# 六、统计
## [探索风速数据](https://github.com/Hyhyhyhyhyhyh/data-analysis/issues/4#issuecomment-582796349)
- 将数据作存储并且设置前三列为合适的索引
- 2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug
- 将日期设为索引，注意数据类型，应该是datetime64[ns]
- 对应每一个location，一共有多少数据值缺失
- 对应每一个location，一共有多少完整的数据值
- 对于全体数据，计算风速的平均值
- 创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差
- 创建一个名为day_stats的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差
- 对于每一个location，计算一月份的平均风速
- 对于数据记录按照年为频率取样
- 对于数据记录按照月为频率取样

# 七、可视化
## [探索泰坦尼克灾难数据](https://github.com/Hyhyhyhyhyhyh/data-analysis/issues/4#issuecomment-582797782)
- 将数据框命名为titanic
- 将PassengerId设置为索引
- 绘制一个展示男女乘客比例的扇形图
- 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
- 有多少人生还？
- 绘制一个展示船票价格的直方图

# 八、创建数据框
## [探索Pokemon数据](https://github.com/Hyhyhyhyhyhyh/data-analysis/issues/4#issuecomment-582802172)
- 创建一个数据字典
- 将数据字典存为一个名叫pokemon的数据框中
- 数据框的列排序是字母顺序，请重新修改为name, type, hp, evolution, pokedex这个顺序
- 添加一个列place['park','street','lake','forest']
- 查看每个列的数据类型
```python
import pandas as pd
#创建一个数据字典
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
            "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
            "type": ['grass', 'fire', 'water', 'bug'],
            "hp": [45, 39, 44, 45],
            "pokedex": ['yes', 'no','yes','no']                        
            }

九、时间序列

探索Apple公司股价数据

读取数据并存为一个名叫apple的数据框
查看每一列的数据类型
将Date这个列转换为datetime类型
将Date设置为索引
有重复的日期吗？
将index设置为升序
找到每个月的最后一个交易日(business day)
数据集中最早的日期和最晚的日期相差多少天？
在数据中一共有多少个月？
按照时间顺序可视化Adj Close值

十、删除数据

探索Iris纸鸢花数据

将数据集存成变量iris
创建数据框的列名称['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
数据框中有缺失值吗？
将列petal_length的第10到19行设置为缺失值
将petal_lengt缺失值全部替换为1.0
删除列class
将数据框前三行设置为缺失值
删除有缺失值的行
重新设置索引

一、探索Chipotle快餐数据

将数据集存入一个名为chipo的数据框内


# 观察数据
with open('chipotle.tsv') as f:
head = [next(f) for i in range(5)]
print(head)
print('*'*100)

加载数据

chipo = pd.read_csv('./chipotle.tsv', delimiter='\t')


- 查看前10行内容
```python
# 查看前10行
print(chipo.head(10))

数据集中有多少个列(columns)？

# 查看列数
print('列数:', len(chipo.columns))

打印出全部的列名称

print('所有的列名:', chipo.columns)

数据集的索引是怎样的？
```
print('数据集索引:', chipo.index)
```

被下单数最多商品(item)是什么?

print(chipo['item_name'].value_counts()[:1])

在item_name这一列中，一共有多少种商品被下单

print('在item_name这一列中，一共有多少种商品被下单:', len(chipo['item_name'].value_counts()))

在choice_description中，下单次数最多的商品是什么
```
'''
rule = r"[\[\]]+"
re_obj = re.compile(rule)
```

def transform_choice(t): if t is NaN: return None return re_obj.sub('', t) top_choice = chipo['choice_description'].apply(transform_choice) top_choice = top_choice.ravel().tolist() result = [] for i in top_choice: if i: result.extend(i.split(','))

c = Counter(result) c.most_common(1) ''' top_choice = chipo['choice_description'].value_counts()[:1].index.tolist()[0] chipo[chipo.choice_description == top_choice]['item_name'].unique()


- 一共有多少商品被下单？
```python
print('一共有多少商品被下单:', chipo['quantity'].sum())

将item_price转换为浮点数

def transform_price(t):
return t.replace('$', '')
chipo['price_new'] = chipo['item_price'].apply(transform_price).astype('float')

在该数据集对应的时期内，收入(revenue)是多少

revenue = chipo['price_new'] * chipo['quantity']
revenue = revenue.sum()
print('收入:', revenue)

在该数据集对应的时期内，一共有多少订单

order_cnt = len(chipo['order_id'].unique())
print('订单总数:', order_cnt)

每一单(order)对应的平均总价是多少

print('每一单的平均总价:', revenue / order_cnt)

二、探索2012欧洲杯数据

将数据集命名为euro12

# 加载数据
euro12 = pd.read_csv('./Euro2012.csv')

只选取 Goals 这一列
```
goals = euro12['Goals']
```

有多少球队参与了2012欧洲杯？

print('有多少球队参与了2012欧洲杯:', len(euro12['Team'].unique()))

该数据集中一共有多少列(columns)?
```
print('列数:', len(euro12.columns))
```
将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框
```
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']].copy()
```

对数据框discipline按照先Red Cards再Yellow Cards进行排序

discipline.sort_values(by=['Red Cards', 'Yellow Cards'], inplace=True)

计算每个球队拿到的黄牌数的平均值

print('计算每个球队拿到的黄牌数的平均值', discipline['Yellow Cards'].mean())

找到进球数Goals超过6的球队数据
```
print(euro12[euro12.Goals > 6])
```

选取以字母G开头的球队数据

G_team = filter(lambda x:x[0] == 'G', euro12['Team'])
G_team = list(G_team)

选取前7列
```
columns = euro12.columns[:7]
```
选取除了最后3列之外的全部列
```
columns = euro12.columns[0:-3]
```

找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

display(euro12[(euro12.Team == 'England') | (euro12.Team == 'Italy') | (euro12.Team == 'Russia')][['Team','Shooting Accuracy']])

三、探索酒类消费数据

将数据框命名为drinks

# 加载数据
drinks = pd.read_csv('./drinks.csv')

哪个大陆(continent)平均消耗的啤酒(beer)更多？

drinks.groupby('continent')['beer_servings'].mean().sort_values()[-1:]

打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

print('平均值\n', drinks.groupby('continent')['wine_servings'].mean())
print('总行数\n', drinks.groupby('continent')['wine_servings'].count())
print('中位数\n', drinks.groupby('continent')['wine_servings'].median())
print('标准差\n', drinks.groupby('continent')['wine_servings'].std())
print('方差\n', drinks.groupby('continent')['wine_servings'].var())
print('最大值\n', drinks.groupby('continent')['wine_servings'].max())
print('最小值\n', drinks.groupby('continent')['wine_servings'].min())
print('总和\n', drinks.groupby('continent')['wine_servings'].sum())

打印出每个大陆每种酒类别的消耗平均值
```
drinks.groupby('continent').mean()
```
打印出每个大陆每种酒类别的消耗中位数
```
drinks.groupby('continent').median()
```

打印出每个大陆对spirit饮品消耗的平均值，最大值和最小值

drinks.groupby('continent')['spirit_servings'].mean()
drinks.groupby('continent')['spirit_servings'].max()
drinks.groupby('continent')['spirit_servings'].min()

四、探索1960 - 2014 美国犯罪数据

将数据框命名为crime

crime = pd.read_csv('./US_Crime_Rates_1960_2014.csv')

每一列(column)的数据类型是什么样的？
```
crime.dtypes
```

将Year的数据类型转换为 datetime64

crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')

将列Year设置为数据框的索引

crime.set_index('Year', drop=True, inplace=True)

删除名为Total的列

crime.drop('Total', axis=1, inplace=True)

按照Year（每十年）对数据框进行分组并求和

bins = pd.date_range(start='1959', end='2015', freq='10Y')
crime['year_partition'] = pd.cut(crime.index, bins=bins, labels=False)
crime.groupby('year_partition').sum()

何时是美国历史上生存最危险的年代？

crime['danger'] = (crime['Violent'] + crime['Murder'] + crime['Forcible_Rape'] + crime['Robbery'] + crime['Aggravated_assault'] + crime['Burglary'] + crime['Larceny_Theft'] + crime['Vehicle_Theft']) / crime['Population'] * 8
crime[crime['danger'] == crime['danger'].max()]

五、探索虚拟姓名数据

创建DataFrame

将上述的DataFrame分别命名为data1, data2, data3


raw_data_1 = {
    'subject_id': ['1', '2', '3', '4', '5'],
    'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
    'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

data1 = pd.DataFrame(raw_data_1) data2 = pd.DataFrame(raw_data_2) data3 = pd.DataFrame(raw_data_3)


- 将data1和data2两个数据框按照行的维度进行合并，命名为all_data
```python
all_data = pd.concat([data1, data2], axis=0)

将data1和data2两个数据框按照列的维度进行合并，命名为all_data_col
```
all_data_col = pd.concat([data1, data2], axis=1)
```
打印data3
按照subject_id的值对all_data和data3作合并
```
all_data.merge(data3)
```
对data1和data2按照subject_id作连接
```
data1.merge(data2, on='subject_id')
```

六、探索风速数据

将数据作存储并且设置前三列为合适的索引


# 观察数据
with open('./wind.csv') as f:
print(f.readlines()[0:5])

a = np.loadtxt('./wind.csv', skiprows=1)

wind = pd.DataFrame(a, columns=['Yr', 'Mo', 'Dy', 'RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR', 'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL']) del a

将数据作存储并且设置前三列为合适的索引

wind['date'] = wind['Yr'].astype(int).astype(str).str.cat(wind['Mo'].astype(int).astype(str), sep='-').str.cat(wind['Dy'].astype(int).astype(str), sep='-')

wind['date'] = pd.to_datetime(wind['date'])


- 2061年？我们真的有这一年的数据？创建一个函数并用它去修复这个bug
```python
def transform_date(t):
    if t > pd.to_datetime('2020'):
        return t - pd.Timedelta(days=365*100)
    else:
        return t
wind['date'] = wind['date'].apply(transform_date)

将日期设为索引，注意数据类型，应该是datetime64[ns]
```
wind.set_index(wind['date'], drop=True, inplace=True)
```
对应每一个location，一共有多少数据值缺失
```
wind.isnull().sum()
```
对应每一个location，一共有多少完整的数据值
```
wind.notnull().sum()
```
对于全体数据，计算风速的平均值
```
wind.mean()
```
创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值，最大值，平均值和标准差
```
loc_stats = pd.DataFrame(wind.describe().loc[['min', 'max', 'mean', 'std']])
```

创建一个名为day_stats的数据框去计算并存储所有location的风速最小值，最大值，平均值和标准差

day_stats = pd.DataFrame(
{
    'loc_min': wind.min(axis=1),
    'loc_max': wind.max(axis=1),
    'loc_mean': wind.mean(axis=1),
    'loc_std': wind.std(axis=1),
}
)

对于每一个location，计算一月份的平均风速
```
wind[wind.Mo == 1].groupby('Mo').mean()
```
对于数据记录按照年为频率取样
```
wind.groupby('Yr').mean()
```
对于数据记录按照月为频率取样
```
wind.groupby('Mo').mean()
```

七、探索泰坦尼克灾难数据

将数据框命名为titanic
```
plt.rcParams['font.family'] = 'SimHei'
```

titanic = pd.read_csv('./train.csv')


- 将PassengerId设置为索引
```python
titanic.set_index('PassengerId', drop=True, inplace=True)

绘制一个展示男女乘客比例的扇形图
```
sex = titanic['Sex'].value_counts()
```

fig = plt.figure(figsize=(16,12)) ax1 = fig.add_subplot(2,2,1) plt.pie( sex, labels=['男性', '女性'], shadow=True, explode=[0, 0.1], startangle=215, textprops={'fontsize': 14}, autopct='%1.1f%%', ) plt.title('乘客性别比例') plt.legend()


- 绘制一个展示船票Fare, 与乘客年龄和性别的散点图
```python
ax2 = fig.add_subplot(2,2,2)
plt.scatter(
    titanic[titanic.Sex == 'male']['Age'],
    titanic[titanic.Sex == 'male']['Fare'],
    color='#5677fc',
    s=10,
    label='男性'
)
plt.scatter(
    titanic[titanic.Sex == 'female']['Age'],
    titanic[titanic.Sex == 'female']['Fare'],
    color='#ff9800',
    s=10,
    label='女性'
)
plt.xlabel('年龄')
plt.ylabel('票价', rotation=0)
plt.title('船票价格与乘客年龄性别关系散点图')
plt.legend()

有多少人生还？

titanic['Survived'].value_counts().loc[1]

绘制一个展示船票价格的直方图


ax3 = fig.add_subplot(2,2,3)
sns.distplot(
titanic['Fare'],
bins=20,
)
plt.title('船票价格直方图')
plt.xlabel('船票价格')

plt.show()



![无标题](https://user-images.githubusercontent.com/44884619/73921046-97df2500-4901-11ea-8825-8de643535dcc.png)

八、探索Pokemon数据

创建一个数据字典

将数据字典存为一个名叫pokemon的数据框中


raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
        "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
        "type": ['grass', 'fire', 'water', 'bug'],
        "hp": [45, 39, 44, 45],
        "pokedex": ['yes', 'no','yes','no']                        
        }

pokemon = pd.DataFrame(raw_data)


- 数据框的列排序是字母顺序，请重新修改为name, type, hp, evolution, pokedex这个顺序
```python
pokemon = pokemon.reindex(columns=['name', 'type', 'hp', 'evolution', 'pokedex'])

添加一个列place['park','street','lake','forest']
```
pokemon['place'] = ['park','street','lake','forest']
```
查看每个列的数据类型
```
pokemon.dtypes
```

九、探索Apple公司股价数据

读取数据并存为一个名叫apple的数据框
```
apple = pd.read_csv('./appl_1980_2014.csv')
```
查看每一列的数据类型
```
apple.dtypes
```

将Date这个列转换为datetime类型

apple['Date'] = pd.to_datetime(apple['Date'])

将Date设置为索引

apple.set_index('Date', drop=True, inplace=True)

有重复的日期吗？
```
apple.index.duplicated().sum()
```

将index设置为升序

apple.sort_index(ascending=True, inplace=True)

找到每个月的最后一个交易日(business day)
```
apple.resample('BM').sum().index
```
数据集中最早的日期和最晚的日期相差多少天？
```
apple.index.max() - apple.index.min()
```

在数据中一共有多少个月？

month = set()
for date in apple.index.astype(str).tolist():
month.add(date[0:7])
len(month)

按照时间顺序可视化Adj Close值

fig = plt.figure(figsize=(10,4))
plt.plot(apple['Adj Close'], label='Adj Close')
plt.xlabel('年份')
plt.title('Adj Close趋势图')
plt.legend()

无标题

十、探索Iris纸鸢花数据

将数据集存成变量iris


# 观察数据
with open('./iris.data') as f:
print(f.readlines()[0:10])

t = np.loadtxt('./iris.data', delimiter=',', dtype='S')


- 创建数据框的列名称['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
```python
iris = pd.DataFrame(t, columns=['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])

iris.sepal_length = iris.sepal_length.astype(float)
iris.sepal_width = iris.sepal_width.astype(float)
iris.petal_length = iris.petal_length.astype(float)
iris.petal_width = iris.petal_width.astype(float)

数据框中有缺失值吗？
```
iris.duplicated().sum()
```
将列petal_length的第10到19行设置为缺失值
```
iris.loc[10:20, 'petal_length'] = NaN
```

将petal_lengt缺失值全部替换为1.0

iris['petal_length'] = iris['petal_length'].fillna(1)

删除列class

iris.drop('class', axis=1, inplace=True)

将数据框前三行设置为缺失值
```
iris.loc[0:4, :] = NaN
```
删除有缺失值的行
```
iris.dropna(axis=0, inplace=True)
```
重新设置索引
```
iris.reset_index(inplace=True)
```

Hyhyhyhyhyhyh / data-analysis

Pandas练习 #4

一、开始了解你的数据

探索Chipotle快餐数据

二、数据过滤与排序

探索2012欧洲杯数据

三、数据分组

探索酒类消费数据

四、Apply函数

探索1960 - 2014 美国犯罪数据

五、合并

探索虚拟姓名数据

九、时间序列

探索Apple公司股价数据

十、删除数据

探索Iris纸鸢花数据

一、探索Chipotle快餐数据

加载数据

二、探索2012欧洲杯数据

三、探索酒类消费数据

四、探索1960 - 2014 美国犯罪数据

五、探索虚拟姓名数据

六、探索风速数据

将数据作存储并且设置前三列为合适的索引

七、探索泰坦尼克灾难数据

八、探索Pokemon数据

九、探索Apple公司股价数据

十、探索Iris纸鸢花数据