covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Testing Data missing Starting 06/08 for Pennsylvania Counties #478

Open arorapankaj opened 4 years ago

arorapankaj commented 4 years ago

Hi Team, County Level Testing data for Pennsylvania is missing in the CSV (Tidy Format) starting 6/8/2020. Could you please look into this? Thanks, Pankaj

jzohrab commented 4 years ago

Thanks Pankaj. Unfortunately we're not going to be able to look into this for a while, due to our existing backlog. If you or a friend/colleague/contact can look into the source code and data, that would be a great help to expedite the fix. Cheers! jz

arorapankaj commented 4 years ago

@jzohrab : Would you know any alternate source where this information is available ?

jzohrab commented 4 years ago

Hm ... not offhand. I've resorted to Google and then poking around on government sites in order to find good sources. If you find one, please let me know! jz

arorapankaj commented 4 years ago

Pennsylvania Department of health have started reporting the data in the pdf format from 06/09. Starting 06/10 the pdf's are in consistent format. All the archived files are present on this link

I wrote a small code to extract all the pdf's (I am sure you'll have a better way to code this as I am new to Python). Now trying to convert pdf's to dataframe to use this data.

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

output_list = []
main_url = 'https://www.health.pa.gov/'
parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/Archives.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href'])):
        output_list.append(main_url + link['href'])

parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/June-Archive.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href']) and ('6-9-2020' not in link['href'])):
        output_list.append(main_url + link['href'])
jzohrab commented 4 years ago

This project only runs nodejs sources. Extracting from PDFs is tough/annoying. Thanks for the link, I'll keep it in mind as when we port PA to the new system. Cheers! jz