1月26日付けの県内発生事例(1月)PDFがpdfplumberで表抽出できず

amay077 commented 3 years ago

この PDF 365019.pdf では、1ページ目の page.extract_table() が None を返して後続処理が失敗する模様。

        with pdfplumber.open(path_pdf) as pdf:
            print(len(pdf.pages))
            for page in pdf.pages:
                print(index)
                table = page.extract_table()
                print(table)
                df_tmp = pd.DataFrame(table[1:], columns=table[0])
                dfs.append(df_tmp)

何らかのオプション https://github.com/jsvine/pdfplumber/blob/stable/README.md#extracting-tables が必要でしょうか。。

takainou commented 3 years ago

データ的には、1/25掲載PDFと文字コード表記揺れの仕方が異なっていましたので、担当が違ったのでしょうか。

1/27掲載(365219.pdf)では1/25掲載PDFと同じ文字コード表記揺れに戻ったようですが、豊橋市分49人の情報がきちんと入っていません。「新規陽性者数」の人数はカウントされると思いますが、「市町村別感染状況」の豊橋市分は少なくなってしまいます。

amay077 commented 3 years ago

とりあえず本日(1/27)の取り込みは成功したようです(豊橋市のデータは直して欲しいですが..)。

数日様子を見て、一過性のものと判断したら本件は close します。

imabari commented 3 years ago

「Microsoft Print To PDF」で作成したPDFがよく失敗していることが多い気がします

1行になったのでテキストで抽出できそう失敗したときはこちらで試してください

table_settings = { "vertical_strategy": "text", "horizontal_strategy": "text", }

page.extract_table(table_settings)

表上の日付もテキスト取っているようなので1ページ目かヘッダーを調べるかしたほうがいいかも

if page.page_number == 1:
    df_tmp = pd.DataFrame(table[2:], columns=table[1])

imabari commented 3 years ago

textだと国籍の列が空なのでずれてしまいますね

dfs = []

for page in pdf.pages:
    table = page.extract_table(table_settings)

    df_tmp = pd.DataFrame(table)

    dfs.append(df_tmp)

df = pd.concat(dfs)

df1 = df[~(df.iloc[:, -1] == "備考")].iloc[1:]

imabari commented 3 years ago

康熙部首・CJK部首補助の文字化けにこちらのツールを利用することはできますか https://github.com/trueroad/pdf-fix-tuc

amay077 commented 3 years ago

pdf-fix-tuc のインストール手順

$ git clone https://github.com/trueroad/pdf-fix-tuc.git $ cd pdf-fix-tuc $ ./autogen.sh $ mkdir build $ cd build $ ../configure $ make $ make install

を、今使っている docker コンテナで試してみましたが、configure か make でエラーが出てるので、少々面倒そうですね(私が gcc にあまり詳しくなく...)。

imabari commented 3 years ago

apt update
apt install build-essential
apt install autoconf automake libtool
apt install libqpdf-dev

あまり詳しくはないのですが上のをいれたらビルドできました一度ビルドしたらコピペで使えるのかわからないのですが

imabari commented 3 years ago

https://github.com/imabari/covid19-data/blob/master/aichi/aichi_patients_check.ipynb

tabula-pyで作成

一度変換したPDFはハッシュチェックしてdataframeを再利用するようにしましたはじめて書いたので確認お願いします

性別にスペースがありカウントされていないものがありましたのでこちらも修正お願いします df["年代・性別"] = df["年代・性別"].str.normalize("NFKC").str.replace(" ", "")

imabari commented 3 years ago

camelotで50ページ分割で20分ぐらいかかっています

dataframeを再利用の場合でしたら月1回なので許容できるかどうか

tabulaだと44秒ほどで終わります

imabari commented 3 years ago

camelotの50ページ分割サンプル

!apt install python3-tk ghostscript !pip install camelot-py[cv] !pip install more-itertools

import camelot
import pandas as pd
from more_itertools import chunked

# ページリスト取得
handler = camelot.handlers.PDFHandler("data.pdf")
pages = handler._get_pages("data.pdf", pages="all")

# ページ範囲のリスト作成
pages_list = [str(i[0]) if i[0] == i[-1] else f"{i[0]}-{i[-1]}" for i in chunked(pages, 50)]
pages_list

dfs = []

for page in pages_list:

    tables = camelot.read_pdf(
        "data.pdf",
        pages=page,
        split_text=True,
        strip_text=" \n",
        line_scale=40,
    )

    for table in tables:
        dfs.append(pd.DataFrame(table.data[1:], columns=table.data[0]))

df = pd.concat(dfs)

df.shape

amay077 commented 3 years ago

https://github.com/code4nagoya/covid19-aichi-tools/issues/81#issuecomment-771558222

とりあえず 2/3 朝の取り込みが正常に動いて、2/2 までの発生事例PDFが取り込めたようです。 https://raw.githubusercontent.com/code4nagoya/covid19/master/data/patients.csv

https://github.com/code4nagoya/covid19-aichi-tools/issues/90#issuecomment-771625050

camelot の PDF ページ分割、遅いのですね。。。となると tabula-py ですかね、試してみます、ありがとうございます。

takainou commented 3 years ago

PDFが表抽出出来ない件ではないですが、2/7朝の2月発生事例一覧PDF(366620.pdf)にて、豊橋市7人分の情報(年代・性別/住居地/接触状況)欠落が再発しています。https://github.com/code4nagoya/covid19-aichi-tools/issues/90#issuecomment-767945068 「新規陽性者数」の人数はカウントされると思いますが、「市町村別感染状況」の豊橋市分は少なくなってしまいます。指標「70歳以上」の数値部分は1日遅れ(検査人数が揃うのが1日遅れる)なので、明日直れば少なく表示せずに済むと思います。

imabari commented 3 years ago

Excelのデータの取得でPDFを変換したほうが優秀かもしれない 1/26のデータも変換できています

code4nagoya / covid19-aichi-tools

1月26日付けの県内発生事例(1月)PDFがpdfplumberで表抽出できず #90