Hyhyhyhyhyhyh commented 4 years ago

注

分析思路学习自：https://blog.csdn.net/u014044812/article/details/104076639
原作者源代码地址：https://github.com/pig6/china_population
数据来源网站：http://data.stats.gov.cn/

Hyhyhyhyhyhyh commented 4 years ago

爬取人口数据

观察接口

观察接口及返回值发现：

点击左侧的不同目录，dfwds随之改变参数，因此我们填写正确的dfwds参数就能获取到接口数据
在前端操作需要点击选择左侧目录，再选择右侧数据的时间段，这两步调用的接口都是同一个接口，仅是请求参数不一样；且必须先请求对应目录，再请求对应数据，尝试把这两步的请求参数合并到一起
前端选项只能选“最近二十年”数据，其请求参数为LAST20，试试把它改成LAST70尝试获取1949年以来的全部数据
从上图2中可以看到，不同数据项（行）是以编号区分，比如总人口数对应编号A030101，而其他数据项则是A03010X
```
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
```

http://data.stats.gov.cn/easyquery.htm?cn=C01

爬取人口数据

总人口

url = 'http://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgnd&rowcode=zb&colcode=sj&wds=[]&dfwds=[{"wdcode":"zb","valuecode":"A0301"}, {"wdcode":"sj","valuecode":"LAST70"}]&k1=1581078799246' response = requests.get(url) res = response.json() population = res['returndata']['datanodes']

出生率、死亡率、自然增长率

url2 = 'http://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgnd&rowcode=zb&colcode=sj&wds=[]&dfwds=[{"wdcode":"zb","valuecode":"A0302"}, {"wdcode":"sj","valuecode":"LAST70"}]&k1=1581079751047' response = requests.get(url2) res = response.json() rate = res['returndata']['datanodes']

人口结构

url3 = 'http://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgnd&rowcode=zb&colcode=sj&wds=[]&dfwds=[{"wdcode":"zb","valuecode":"A0303"}, {"wdcode":"sj","valuecode":"LAST70"}]&k1=1581079906591' response = requests.get(url3) res = response.json() structure = res['returndata']['datanodes']


## 处理接口返回的数据
```python
# 将数据保存到excel中
# 观察接口数据
# data.data列数据
# wds.valuecode[0]=A030101，A030102来区分前端的每一行数据
# wds.valuecode[1]年
def get_data(res):
    res_dict = {
        'year': [],
        'line': [],
        'data': []
    }
    for i in res:
        res_dict['year'].append(i['wds'][1]['valuecode'])
        res_dict['line'].append(i['wds'][0]['valuecode'])
        res_dict['data'].append(i['data']['data'])
    return res_dict

p = pd.DataFrame(get_data(population))
r = pd.DataFrame(get_data(rate))
s = pd.DataFrame(get_data(structure))

p = pd.concat([
        p[p.line == 'A030101'][['year', 'data']],
        p[p.line == 'A030102'].reset_index()['data'],
        p[p.line == 'A030103'].reset_index()['data'],
        p[p.line == 'A030104'].reset_index()['data'],
        p[p.line == 'A030105'].reset_index()['data'],
    ],
    axis=1,
    ignore_index=True,
)
p.rename(columns={
    0: 'year',
    1: '年末总人口(万人)',
    2: '男性人口(万人)',
    3: '女性人口(万人)',
    4: '城镇人口(万人)',
    5: '乡村人口(万人)'
}, inplace=True)
p.set_index('year', drop=True, inplace=True)

r = pd.concat([
        r[r.line == 'A030201'][['year', 'data']],
        r[r.line == 'A030202'].reset_index()['data'],
        r[r.line == 'A030203'].reset_index()['data'],
    ],
    axis=1,
    ignore_index=True,
)
r.rename(columns={
    0: 'year',
    1: '人口出生率(‰)',
    2: '人口死亡率(‰)',
    3: '人口自然增长率(‰)'
}, inplace=True)
r.set_index('year', drop=True, inplace=True)

s = pd.concat([
        s[s.line == 'A030301'][['year', 'data']],
        s[s.line == 'A030302'].reset_index()['data'],
        s[s.line == 'A030303'].reset_index()['data'],
        s[s.line == 'A030304'].reset_index()['data'],
        s[s.line == 'A030305'].reset_index()['data'],
        s[s.line == 'A030306'].reset_index()['data'],
        s[s.line == 'A030307'].reset_index()['data'],
    ],
    axis=1,
    ignore_index=True,
)
s.rename(columns={
    0: 'year',
    1: '年末总人口(万人)',
    2: '0-14岁人口(万人)',
    3: '15-64岁人口(万人)',
    4: '65岁及以上人口(万人)',
    5: '总抚养比(%)',
    6: '少儿抚养比(%)',
    7: '老年抚养比(%)',
}, inplace=True)
s.set_index('year', drop=True, inplace=True)

people = pd.concat([p, r, s], axis=1)

到这里已经对人口数据获取和处理完毕了，但是数据未录入2019年的数据，因此根据报告手工计算2019年的人口数据（http://www.stats.gov.cn/tjsj/zxfb/202001/t20200117_1723383.html）

总抚养比 = 非劳动年龄人口数 / 劳动年龄人口数
少儿抚养比 = 少儿人口数 / 劳动年龄人口数

老年抚养比 = 老年人口数 / 劳动年龄人口数


data_2019 = pd.DataFrame({'2019': [140005, 71527, 68478, 84843, 55162, 10.48, 7.14, 3.34, 140005, 32762, 89640, 17603, (32762+17603)/89640*100, 32762/89640*100, 17603/89640*100]}).T
data_2019.rename(columns={
0:  '年末总人口(万人)',
1:  '男性人口(万人)',
2:  '女性人口(万人)',
3:  '城镇人口(万人)',
4:  '乡村人口(万人)',
5:  '人口出生率(‰)',
6:  '人口死亡率(‰)',
7:  '人口自然增长率(‰)',
8:  '年末总人口(万人)',
9:  '0-14岁人口(万人)',
10: '15-64岁人口(万人)',
11: '65岁及以上人口(万人)',
12: '总抚养比(%)',
13: '少儿抚养比(%)',
14: '老年抚养比(%)',
}, inplace=True)

people = pd.concat([people, data_2019], axis=0) people.sort_index(ascending=False, inplace=True) people['year'] = people['year'].astype(int)

将合并的人口数据保存到Excel

writer = pd.ExcelWriter(path='./population.xlsx') people.to_excel(writer, sheet_name='中国70年人口数据', index=True) writer.save() writer.close()



## 处理后的Excel如下
[population.xlsx](https://github.com/Hyhyhyhyhyhyh/data-analysis/files/4176801/population.xlsx)

Hyhyhyhyhyhyh commented 4 years ago

分析的维度

总人口数

增长趋势
最大最小值
计划生育的影响

男女性别比例

2019年男女性占比
历年男性占总人口数占比
历年男女性人数变化趋势
历年男女人口差值变化趋势

城镇化

2019年城镇化比例
历年城镇化变化趋势
历年城镇化率变化趋势

城镇化率 = 城镇人口 / 总人口数

人口增长率

历年人口增长率、出生率、死亡率变化趋势

年龄结构

人口年龄结构是衡量老龄化与人口红利的指标。

人口红利：经济学术语，是指一个国家的劳动年龄人口占总人口比重较大，抚养率比较低，为经济发展创造了有利的人口条件，整个国家的经济呈高储蓄、高投资和高增长的局面。

2019年年龄结构占比
历年年龄结构变化趋势
历年抚养比变化趋势

Hyhyhyhyhyhyh commented 4 years ago

人口总数变化分析

# 历年人口数增长曲线
sns.set(style='darkgrid')
plt.rcParams['font.family'] = 'SimHei'

# 将总人口数据转换为亿为单位
year = people['year'][::-1]
total_population = people['年末总人口(万人)'][::-1] / 10000

fig = plt.figure(figsize=(12,6))
plt.plot(year, total_population, c='#5677fc')
plt.xticks(range(1949,2020,5))
plt.xlabel('年份')
plt.ylabel('年末人口数（亿人）')
plt.title('中国70年人口增长趋势图', fontsize=21)

# 设置数据标签
# 最低点
plt.annotate(
    xy=(1949, total_population[1949]),
    xytext=(1949, 6),
    s='2019年\n{}'.format(total_population[1949]),
    ha='center',
    va='bottom',
    color='white',
    arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='#009688'),
    bbox=dict(fc='#009688', alpha=0.5, boxstyle='round,pad=0.5')
)
# 最高点
plt.annotate(
    xy=(2019, total_population[2019]),
    xytext=(2019, 13),
    s='2019年\n{}'.format(total_population[2019]),
    ha='center',
    va='bottom',
    color='white',
    arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='#009688'),
    bbox=dict(fc='#009688', alpha=0.5, boxstyle='round,pad=0.5')
)
# 60和61年入口下滑
plt.annotate(
    xy=(1960, total_population[1960]),
    xytext=(1960, 8),
    s='1960年\n{}'.format(total_population[1960]),
    ha='center',
    va='bottom',
    color='white',
    arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='r'),
    bbox=dict(fc='r', alpha=0.5, boxstyle='round,pad=0.5')
)
plt.annotate(
    xy=(1961, total_population[1961]),
    xytext=(1961, 10),
    s='1961年\n{}'.format(total_population[1961]),
    ha='center',
    va='bottom',
    color='white',
    arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='r'),
    bbox=dict(fc='r', alpha=0.5, boxstyle='round,pad=0.5')
)
plt.vlines(1979, 4, 14, color='r', label='计划生育', ls='--')
plt.legend()

对比计划生育前后30年总增长人口总数

fig = plt.figure(figsize=(7,6))
before = round(total_population[1979] - total_population[1949], 2)
after = round(total_population[2019] - total_population[1979], 2)
plt.bar(0, before, label='计划生育前30年')
plt.bar(1, after, label='计划生育后30年')
plt.xticks([0,1], labels=['计划生育前30年', '计划生育后30年'])
plt.yticks(range(0, 7))
plt.ylabel('人口增长总数（亿人）')
plt.text(x=0, y=before+0.1, s=before, ha='center')
plt.text(x=1, y=after+0.1, s=after, ha='center')
plt.title('计划生育前后30年总增长人口总数对比', fontsize=21)
plt.legend()

Hyhyhyhyhyhyh commented 4 years ago

性别分析

# 2019年男女比
male_2019 = people[people.year == 2019]['男性人口(万人)'].values
female_2019 = people[people.year == 2019]['女性人口(万人)'].values

fig = plt.figure(figsize=(16,10))

ax1 = fig.add_subplot(2,2,1)
plt.pie(
    [male_2019, female_2019],
    labels=['男性', '女性'],
    startangle=265,
    textprops={'fontsize': 14},
    autopct='%1.1f%%',
    explode=(0, 0.05),
    shadow=True
)
plt.title('2019年男女总人口占比', fontsize=21)
plt.legend()

# 男性占总人口比例
male = people['男性人口(万人)'][::-1] / 10000
female = people['女性人口(万人)'][::-1] / 10000

male_per = round((male / total_population) * 100, 2)
ax2 = fig.add_subplot(2,2,2)
plt.plot(year, male_per)
plt.xticks(range(1949, 2020, 5))
plt.xlabel('年份')
plt.ylabel('男性占总人口比例(%)')
plt.title('历年男性占总人口比例曲线', fontsize=21)
plt.annotate(xy=(1949, 52), s='1949年\n52%', ha='center', va='bottom')
plt.annotate(xy=(1996, 50.8), s='1949年\n50.8%', ha='center', va='top')
plt.yticks(np.arange(50.5, 53.0, 0.5))

# 男女人口数曲线
ax3 = fig.add_subplot(2,2,3)
plt.plot(year, male, label='男性', c='#5677fc')
plt.plot(year, female, label='女性', c='#e51c23')
plt.xticks(range(1949, 2020, 5))
plt.xlabel('年份')
plt.ylabel('人口总数（亿人）')
plt.title('历年男女人口增长曲线', fontsize=21)
plt.legend()

# 男女人口数差值
ax4 = fig.add_subplot(2,2,4)
plt.plot(year, male - female)
plt.xticks(range(1949, 2020, 5))
plt.xlabel('年份')
plt.ylabel('男女人口差值（亿人）')
plt.title('历年男女人口差值曲线', fontsize=21)
plt.annotate(xy=(1965, 0.1718), s='1965年\n0.1718', ha='center', va='top')
plt.annotate(xy=(2000, 0.4131), s='2000年\n0.4131', ha='center', va='bottom')
plt.yticks(np.arange(0.10, 0.55, 0.05))
plt.show()

Hyhyhyhyhyhyh commented 4 years ago

城镇化分析

urban = people['城镇人口(万人)'][::-1] / 10000
country = people['乡村人口(万人)'][::-1] / 10000

fig = plt.figure(figsize=(14,8))

# 2019年城镇占比图
ax1 = fig.add_subplot(2,2,1)
plt.pie(
    [urban[2019], country[2019]],
    labels=['城镇总人口', '乡村总人口'],
    explode=(0, 0.05),
    shadow=True,
    autopct='%1.1f%%',
    startangle=231
)
plt.title('2019年城镇人口占比图', fontsize=21)
plt.legend()

# 历年城镇人口变化趋势图
ax2 = fig.add_subplot(2,2,3)
plt.plot(year, urban, color='#5677fc', label='城镇人口')
plt.plot(year, country, color='#e51c23', label='乡村人口')
plt.xticks(range(1949, 2020, 5), rotation=45)
plt.xlabel('年份')
plt.ylabel('总人口（亿人）')
plt.yticks(range(0,12,1))
plt.title('历年城镇人口变化趋势图', fontsize=21)
plt.axvline(x=2010, ls='--', color='g', label='2010年')
plt.annotate(
    xy=(1995, 8.5947),
    s='1995年乡村人口\n8.6亿',
    ha='center',
    va='bottom',
)
plt.annotate(
    xy=(1995, 3.5174),
    s='1995年城镇人口\n3.52亿',
    ha='center',
    va='top',
)
plt.legend()

# 历年城镇化变化趋势图
urbanization = round(urban / total_population * 100, 2)

ax3 = fig.add_subplot(2,2,4)
plt.plot(year, urbanization, label='城镇化率')
plt.title('历年城镇化率变化趋势图', fontsize=21)
plt.ylabel('城镇化率(%)')
plt.xlabel('年份')
plt.xticks(range(1949, 2020, 5), rotation=45)
plt.legend()

Hyhyhyhyhyhyh commented 4 years ago

人口增长率变化分析

fig = plt.figure(figsize=(12,5))
plt.plot(year, people['人口出生率(‰)'][::-1], label='人口出生率(‰)', c='r', ls='--')
plt.plot(year, people['人口死亡率(‰)'][::-1], label='人口死亡率(‰)', c='g', ls='--')
plt.plot(year, people['人口自然增长率(‰)'][::-1], label='人口自然增长率(‰)', c='b')
plt.xlabel('年份')
plt.xticks(range(1949, 2020, 5), rotation=45)
plt.title('历年人口自然增长率变化趋势图', fontsize=21)
plt.axhline(y=0, ls='-', color='gray')
plt.legend()

Hyhyhyhyhyhyh commented 4 years ago

年龄结构变化分析

只有1990年后完整连续的年龄结构


# 年龄结构分析
fig = plt.figure(figsize=(16,9))

ax1 = fig.add_subplot(2,2,1) age_structure_2019 = people[people.year==2019][['0-14岁人口(万人)', '15-64岁人口(万人)', '65岁及以上人口(万人)']] plt.pie( age_structure_2019, labels=age_structure_2019.columns, autopct='%1.1f%%', shadow=True, startangle=135 ) plt.title('2019年年龄结构占比图', fontsize=21) plt.legend()

ax2 = fig.add_subplot(2,2,2) age_structure_1990 = people[people.year==1990][['0-14岁人口(万人)', '15-64岁人口(万人)', '65岁及以上人口(万人)']] plt.pie( age_structure_1990, labels=age_structure_1990.columns, autopct='%1.1f%%', shadow=True, startangle=110 ) plt.title('1990年年龄结构占比图', fontsize=21) plt.legend()

历年人口结构变化趋势图

ax3 = fig.add_subplot(2,2,3) plt.plot(year[year>=1990], people[people.year>=1990]['0-14岁人口(万人)'][::-1], c='r', label='0-14岁人口(万人)', ls='--') plt.plot(year[year>=1990], people[people.year>=1990]['15-64岁人口(万人)'][::-1], c='g', label='15-64岁人口(万人)') plt.plot(year[year>=1990], people[people.year>=1990]['65岁及以上人口(万人)'][::-1], c='b', label='65岁及以上人口(万人)', ls='--') plt.xticks(range(1990, 2020, 3), rotation=45) plt.title('历年人口结构变化趋势图', fontsize=21) plt.xlabel('年份') plt.legend()

历年抚养率变化趋势图

ax4 = fig.add_subplot(2,2,4) plt.plot(year[year>=1990], people[people.year>=1990]['总抚养比(%)'][::-1], c='r', label='总抚养比(%)') plt.plot(year[year>=1990], people[people.year>=1990]['少儿抚养比(%)'][::-1], c='g', label='少儿抚养比(%)') plt.plot(year[year>=1990], people[people.year>=1990]['老年抚养比(%)'][::-1], c='b', label='老年抚养比(%)') plt.xticks(range(1990, 2020, 3), rotation=45) plt.title('历年抚养率变化趋势图', fontsize=21) plt.xlabel('年份') plt.legend()


![2](https://user-images.githubusercontent.com/44884619/74103468-21535900-4b87-11ea-8cc3-193fe1aa2e8f.png)

Hyhyhyhyhyhyh / data-analysis

中国人口数据分析 #7

注

爬取人口数据

观察接口

http://data.stats.gov.cn/easyquery.htm?cn=C01

爬取人口数据

总人口

出生率、死亡率、自然增长率

人口结构

将合并的人口数据保存到Excel

分析的维度

总人口数

男女性别比例

城镇化

人口增长率

年龄结构

人口总数变化分析

对比计划生育前后30年总增长人口总数

性别分析

城镇化分析

人口增长率变化分析

年龄结构变化分析

历年人口结构变化趋势图

历年抚养率变化趋势图