covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Testing Data missing Starting 06/08 for Pennsylvania Counties #468

Closed jzohrab closed 3 years ago

jzohrab commented 3 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/1055, transferred here on Monday Jun 22, 2020 at 14:00 GMT


Hi Team, County Level Testing data for Pennsylvania is missing in the CSV (Tidy Format) starting 6/8/2020. Could you please look into this? Thanks, Pankaj

jzohrab commented 3 years ago

(Transferred comment)

Thanks Pankaj. Unfortunately we're not going to be able to look into this for a while, due to our existing backlog. If you or a friend/colleague/contact can look into the source code and data, that would be a great help to expedite the fix. Cheers! jz

jzohrab commented 3 years ago

(Transferred comment)

@jzohrab : Would you know any alternate source where this information is available ?

jzohrab commented 3 years ago

(Transferred comment)

Hm ... not offhand. I've resorted to Google and then poking around on government sites in order to find good sources. If you find one, please let me know! jz

jzohrab commented 3 years ago

(Transferred comment)

Pennsylvania Department of health have started reporting the data in the pdf format from 06/09. Starting 06/10 the pdf's are in consistent format. All the archived files are present on this link

I wrote a small code to extract all the pdf's (I am sure you'll have a better way to code this as I am new to Python). Now trying to convert pdf's to dataframe to use this data.

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

output_list = []
main_url = 'https://www.health.pa.gov/'
parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/Archives.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href'])):
        output_list.append(main_url + link['href'])

parser = 'html.parser'
resp = urllib.request.urlopen("https://www.health.pa.gov/topics/disease/coronavirus/Pages/June-Archive.aspx")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    if ('pdf'.upper() in link['href'].upper() and ('COVID-19%20County%20Data' in link['href']) and ('6-9-2020' not in link['href'])):
        output_list.append(main_url + link['href'])
jzohrab commented 3 years ago

(Transferred comment)

This project only runs nodejs sources. Extracting from PDFs is tough/annoying. Thanks for the link, I'll keep it in mind as when we port PA to the new system. Cheers! jz

jzohrab commented 3 years ago

Dup of #478.