covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
364 stars 179 forks source link

San diego data is wrong #17

Closed raysalem closed 4 years ago

raysalem commented 4 years ago

Website data is below. note this a maitrix, need sum all three columns and to be bias towards positive also sum presumptive-->

  San Diego County1 Federal Quarantine2 Non-San Diego County Residents3
Positive (confirmed cases) 0 2 0
Presumptive Positive 8 1 0
Pending Results 38 6 4
Negative 99 11 8
Total Tested 145 20 12

URL https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html

**Scraper code -->

{
    county: 'San Diego County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html',
    scraper: async function() {
      let $ = await fetch.page(this.url);

      let cases = parse.number($('td:contains("Positive (confirmed cases)")').next('td').text()) + parse.number($('td:contains("Presumptive Positive")').next('td').text());
      return {
        cases: cases,
        tested: parse.number($('td:contains("Total Tested")').next('td').text())
      };
    }

I would fix this,b tut dont know Java Scriping

lazd commented 4 years ago

Thanks for the report, fixed!

And hey, this is a great excuse to learn JavaScript.

raysalem commented 4 years ago

the data is still wrong for san diego, and we might want something like this

what is the page https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html -->

Positive Cases in San Diego County Since February 14, 2020Coronavirus Disease 2019 (COVID-19)Updated March 17, 2020

COVID-19 Case Summary | San Diego County Residents | Federal Quarantine | Non-San Diego County Residents | Total Total Positives | 51 | 5 | 4 | 60 Age Groups |   |   |   |   0-17 years | 0 | 0 | 0 | 0 18-64 years | 43 | 1 | 3 | 47 65+ years | 8 | 4 | 1 | 13 Age Unknown | 0 | 0 | 0 | 0 Gender |   |   |   |   Female | 17 | 2 | 2 | 21 Male | 34 | 3 | 2 | 39 Unknown | 0 | 0 | 0 | 0 Hospitalized | 8 | 1 | 1 | 10 Deaths | 0 | 0 | 0 | 0

right now reporting zeros, since the data has changed python solution is -->

import pandas as pd import re import requests from bs4 import BeautifulSoup

URL = 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find("div",{"class":"table parbase section"}) rows = table.find_all('tr')

handle header

header = [row.text for row in rows[1].find_all('td')] header = [re.sub('[ \t\n]+', ' ',h) for h in header]

tbl ={} for row in rows[2:]: #skip the first row data = [r.text for r in row.find_all('td')]
if data[1] =='\xa0':continue
tbl[data[0]]=[int(d) for d in data[1:]] df = pd.DataFrame(tbl, index=header[1:]) display(HTML(df.to_html())) updateDateTime = rows[0].find('td').text.split('\n')[-1].replace("Updated","") print("updateDateTime %s" %updateDateTime )

will generate this -->

  Total Positives 0-17 years 18-64 years 65+ years Age Unknown Female Male Unknown Hospitalized Deaths
51 0 43 8 0 17 34 0 8 0
5 0 1 4 0 2 3 0 1 0
4 0 3 1 0 2 2 0 1 0
60 0 47 13 0 21 39 0 10 0

updateDateTime = March 17, 2020