Testing Data missing Starting 06/08 for Pennsylvania Counties

jzohrab commented 3 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/1055, transferred here on Monday Jun 22, 2020 at 14:00 GMT

Hi Team, County Level Testing data for Pennsylvania is missing in the CSV (Tidy Format) starting 6/8/2020. Could you please look into this? Thanks, Pankaj

jzohrab commented 3 years ago

(Transferred comment)

Thanks Pankaj. Unfortunately we're not going to be able to look into this for a while, due to our existing backlog. If you or a friend/colleague/contact can look into the source code and data, that would be a great help to expedite the fix. Cheers! jz

jzohrab commented 3 years ago

(Transferred comment)

@jzohrab : Would you know any alternate source where this information is available ?

jzohrab commented 3 years ago

(Transferred comment)

Hm ... not offhand. I've resorted to Google and then poking around on government sites in order to find good sources. If you find one, please let me know! jz

jzohrab commented 3 years ago

(Transferred comment)

Pennsylvania Department of health have started reporting the data in the pdf format from 06/09. Starting 06/10 the pdf's are in consistent format. All the archived files are present on this link

I wrote a small code to extract all the pdf's (I am sure you'll have a better way to code this as I am new to Python). Now trying to convert pdf's to dataframe to use this data.

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

output_list = []
main_url = 'https://www.health.pa.gov/'
parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/Archives.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href'])):
        output_list.append(main_url + link['href'])

parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/June-Archive.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href']) and ('6-9-2020' not in link['href'])):
        output_list.append(main_url + link['href'])

jzohrab commented 3 years ago

(Transferred comment)

This project only runs nodejs sources. Extracting from PDFs is tough/annoying. Thanks for the link, I'll keep it in mind as when we port PA to the new system. Cheers! jz

jzohrab commented 3 years ago

Dup of #478.

covidatlas / li

Testing Data missing Starting 06/08 for Pennsylvania Counties #468