cheerup-ehime / cheerup-ehime.github.io

jekyll site for this organization

MIT License

7 stars 5 forks source link

ボランティア数をCSVで保存するスクリプトを用意 #62

Closed sassy closed 6 years ago

sassy commented 6 years ago

54 の対応です。

https://ehimesvc.jp/?p=70 からボランティア数をCSVで保存するようにしました。 ChromeDriverが必要なので、インストールして実行してください。

(pandasでグラフ化しようとしましたが、ちょっと時間がかかりそうなので、 CSV出力で終わってしまいました。)

kkd commented 6 years ago

最初の頃に同じようにchrome driver使ってやってみたのですが、そのコードは捨ててしまったのでありがたいです。 pandasのグラフ化はすでにやってるので、そっちと組み合わせてオートメーションしてみます。ありがとうございました！！

imabari commented 6 years ago

chrome driver使わなくても普通に取れますよ。 TablePressは見えないだけでタグのそのままあるようなので

import csv

import requests
from bs4 import BeautifulSoup

url = 'https://ehimesvc.jp/?p=70'

headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'
}

r = requests.get(url, headers=headers)

if r.status_code == requests.codes.ok:

    soup = BeautifulSoup(r.content, 'html.parser')

    table = [[td.get_text(strip=True) for td in tr.find_all(['th', 'td'])]
             for tr in soup.select_one('#tablepress-7').select('tr')]

    with open('volunteer.tsv', 'w') as fw:

        writer = csv.writer(fw, dialect='excel-tab', lineterminator='\n')
        writer.writerows(table)

imabari commented 6 years ago

pandasで直接取り込めます

import pandas as pd

url = 'https://ehimesvc.jp/?p=70'
dfs = pd.read_html(url, index_col=0)
df = dfs[0]

kkd commented 6 years ago

@imabari マジですか！！→pandasで直接取り込める。恐ろしい子、pandas....

imabari commented 6 years ago

ボランティア数スクレイピング前日比と前週比のdataframeまで作っています。

import pandas as pd

url = 'https://ehimesvc.jp/?p=70'
dfs = pd.read_html(url, index_col=0, na_values=['活動中止', '終了', '休止'])
df = dfs[0]

# 全部欠測値の行を削除
df.dropna(how='all', inplace=True)

# 合計の行を削除
df.drop('合計', inplace=True)

df

# 前日比
df_pre_day = df.diff().fillna(0)
df_pre_day

# 前週比
df_pre_week = df.diff(7).fillna(0)
df_pre_week

sassy commented 6 years ago

ほえー。 pandasすごい

kkd commented 6 years ago

@imabari そっか、df.diff()使えば、timedelta使って日付差分ださなくてもいいですね。超短くなった。。。 pandasの修行が足りてないなぁ。。。。

sassy commented 6 years ago

@imabari それをPull Request出してもらってもいいんですよ。

imabari commented 6 years ago

@sassy github使い慣れていないのでわかりません。

sassy commented 6 years ago

@imabari わかりました！では情報を参考に空いた時間に修正してみますーありがとうございます！

imabari commented 6 years ago

日付をdatetimeにできたらいいのですが pd.to_datetime(df['日付'], format='%m月%d日') だと1900年になって補正の仕方がわかりません。

sassy commented 6 years ago

PRアップデートしました！

imabari commented 6 years ago

日付もdatetimeに変換できたのでこちらでお願いします。

import pandas as pd

url = 'https://ehimesvc.jp/?p=70'
dfs = pd.read_html(url, index_col=0, na_values=['活動中止', '終了', '休止'])
df = dfs[0]

# 全部欠測値の行を削除
df.dropna(how='all', inplace=True)

# 合計の行を削除
df.drop('合計', inplace=True)

# インデックス解除
df.reset_index(inplace=True)

# 2018年を追加
df['日付'] = '2018年' + df['日付'] 
df['日付'] = pd.to_datetime(df['日付'], format='%Y年%m月%d日')

df.set_index('日付', inplace=True)

df

# 前日比
df_pre_day = df.diff().fillna(0)
df_pre_day

# 前週比
df_pre_week = df.diff(7).fillna(0)
df_pre_week

# CSVで書き出し
df.to_csv('volunteer.csv')
df_pre_day.to_csv('volunteer_day.csv')
df_pre_week.to_csv('volunteer_week.csv')