djay / covidthailand

Thailand Covid testing and case data gathered and combined from various sources for others to download or view
126 stars 15 forks source link

Solution Proposal for Vac Report Table #46

Closed porames closed 3 years ago

porames commented 3 years ago

Order of columns remains the same. We can just find the row where there is a number and add column names afterward.

def parse_raw(url):
  response = requests.get(url)
  file = open("tmp/daily_report.pdf", "wb")
  file.write(response.content)
  file.close()
  tables = camelot.read_pdf('tmp/daily_report.pdf', pages='2,3',split_text=True)
  raw_table = pd.DataFrame()
  for i in range(2):
    df=tables[i].df
    df=df[df[1].str.isdigit()]
    df.drop([2], axis=1, inplace=True)
    raw_table = raw_table.append(df,ignore_index=True)
  table_dict=raw_table.transpose().to_dict()
  rows=[]
  for row_num in table_dict:
    cleaned_row=[]
    for (key,value) in table_dict[row_num].items():
      for col in value.replace(" ", "").split('\n'):
        if(col): cleaned_row.append(col)
    rows.append(cleaned_row)
  cleaned_table = pd.DataFrame(rows)
  return cleaned_table
df=parse_raw("https://ddc.moph.go.th/vaccine-covid19/getFiles/9/1628485849393.pdf")
test = df.iloc[:,0:12]
test.columns=["Health Area", "Population", "Vac Allocated AstraZeneca", "Vac Allocated Sinovac", "Vac Allocated Pfizer", "Vac Allocated Total", "Vac Given 1 Cum", "Vac Given 1 %", "Vac Given 2 Cum", "Vac Given 2%", "Vac Given 3 Cum", "Vac Given 3 %"]
display(test)
djay commented 3 years ago

@porames Thanks for this but without a working PR it's really hard to determine if it works. I've now added a test framework that would make it easier to add and test code like this. https://github.com/djay/covidthailand/blob/main/tests/test_scraping.py#L85 The idea is to add json files for just the specific scrapes where something changed. These tests will pickup those files and download just the appropriate file and test if your new code comes up with the same result. It needs to be added to for more parts of the code but I hope it makes it easier to contribute to and also by @pmdscully and anyone else.