ciscorucinski commented 4 years ago

📌 Ongoing Information 📌

Website: Corona Data Scraper Download data and view sources

GitHub: Corona Data Scraper Help write scrapping rules. See Readme

Google Doc: COVID-19 Community Data Collection Public + comment access: Comment information and sources Help us acquire valid, official data sources on all levels: County, State, Country

Slack: COVID Atlas First Join, then go to the COVID Atlas Slack

#mapping : Discussion of data mapping
#scraper-dev : Scraper development discussion
#scraper-issues : Discuss issues with the scraper’s output data
#us-state-county-data
#usa-dashboard : Discussion for how to best use data for metrics and visualizations on http://covid19tracker.us
#documentation : Discussion of information dealing with the Google Doc

Background

It is clear that the team at @CSSEGISandData cannot accommodate and scale with the huge influx of new cases within the US. Therefore, it is perfectly reasonable that they abandoned the county-level reporting of cases. I think when people look at the decision with unbiased and an open mind, they will see that this was the right balance to be as helpful as possible. With that said, it is sad to see the county-level information be abandoned completely. It is very helpful!

I remember seeing a +3 increase in Wisconsin and was wondering exactly where those cases were located, and I had to search for and read a few articles to verify. But this chart could have provided that detail to me very fast!

But again, the current processes cannot scale to the number of new cases. So we have to change the processes if we want to bring this back, and the sooner the better.

Suggestion

So, I suggest some new ability to let the community, who deeply care about this information, to help @CSSEGISandData get as accurate of information as possible. You know what type of information is needed to be registered for each new case, and that baton can be passed on to us to find, report, and verify (with verification probably being the biggest aspect of this effort).

I have seen a lot of people report new data as Issues and this new tool would be the preferred method to report those cases. Maybe they would have to provide an article link. A number of people could verify that information along with location information. This verification could go through multiple steps if needed, but @CSSEGISandData would have the final say in including the data after their own review and verification process.

Ideas

The following are some ideas of how the processes could work.

Stack Overflow's Triage Queue: Quickly move information to other needed areas
Stack Overflow's Review Queue: Review and verification of information
Allow people to point out incorrect data and add it to the queues
Multiple people should review and verify each datapoint.
etc...

Other Benefits

As an extremely positive benefit of this approach is that other countries could start providing their own more-localized data, and @CSSEGISandData could entrust a "country-representative" (CDC, or a respected university in said country) to do the review and verification of those country's more localized data.

ciscorucinski commented 4 years ago

The key to this is to get this up as soon as possible, and the planning phase is going to take a while. So community members (especially those with the needed background) how would you create a system that could handle these needs?

@CSSEGISandData would you support something like this? Would you use it? Could you help the community develop this? Help through the planning stage would be the best as we know you have a lot to do.

Let's not flood this with comments. Show support via emoji...

:heart: - Can help develop this project :eyes: - Can help retrieve, enter, and verify data :rocket: - Can help retrieve, enter, and verify data in a different country

zdavatz commented 4 years ago

I am also interested in doing the same for the county levels (Kantone) in Switzerland. I am willing to help. Zürich already put up its county data on https://opendata.swiss/dataset/covid_19-fallzahlen-kanton-zuerich

ciscorucinski commented 4 years ago

...without overloading the curators of the original repo.

@ecam85 Yes, I have made all of my suggestions with this idea in mind, and I've stated several times about their ability and work that needs to be done already. That is clear.

Could this be done via forks (and maybe pull requests)?

I think this might have to be a new tool. Using GitHub workflows for this verification process doesn't seem to be the right tool for the job, I think. This tool would help @CSSEGISandData get all the needed data in a way that they can feel confident in the accuracy of the data without having to put in a lot of time. So it's kind of an "entrance" into entering data in the CSV files.

One thing that I thought of when thinking of this idea was Stackoverflow's triage and review queues (as an idea to build off of).

ciscorucinski commented 4 years ago

how would you create a system that could handle these needs?

@ecam85 in my 1st comment, I posted this. I have some ideas but that is it. There are people way smarter than me, so I am asking the community that would participate in this effort. If we can't get this going, then the overall effort probably would fail anyway.

Let's not flood this with minutia details. This is the planning stage. Please re-read if needed and let's get some ideas going....

zdavatz commented 4 years ago

Ok, thanks to https://github.com/daenuprobst we now have all the Swiss data in a CSV file. He is grabing the data via BAG Twitter Feed and then publishes the data on github: https://github.com/daenuprobst/covid19-cases-switzerland

nognkantoor commented 4 years ago

Where/How do you obtain the data at the county level? Especially since there are several instances of the same individual being claimed by two, or more county health departments, and recorded as being in their county.

saralioness commented 4 years ago

@ciscorucinski If we are triaging and verifying before commits anyway, I would suggest getting a group together in a shared google drive/sheet. It works extremely well even for large enterprises to collaborate and it would be the quickest to get up and running. I agree with having ambassadors that are assigned to specific regions to focus on case monitoring instead of trying to compile data for the entire world. I'm in the DC area, I can take the US East Coast if we go that route. I do think that having the data at the county level is extremely valuable and am willing to pitch in on this effort.

zdavatz commented 4 years ago

What is also really important is the data you collect: Age, Gender, date/place of first contact.

longsyntax commented 4 years ago

I agree. This additional insight provided by the county-level statistics are invaluable - especially for the folks highly vulnerable to COVID-19. I'm based out of the tri-state area but I'm willing to take up curation of this data for any of the US states.

The map on the CDC website has hyperlinks to each state's Department of Health website, that usually houses these county-level stats (some states like CA require you to visit the individual county's website for the stats) https://www.cdc.gov/coronavirus/2019-ncov/cases-in-us.html#reporting-cases

I'm sure once a few of us get this up and running, more people will reach out to collaborate and share responsibility.

becare-rocket commented 4 years ago

We can with JHU's permission fork the database and then community maintain it. It's kind of a waste of time. I suggest contacting JHU to make a donation and asking for them to use the donation to maintain whatever level of detail you need. If each user gave say $50 there are at least it appears 300 users, so that is $15K. They probably need about 3x that. I think they need 2 to 3 full time people until the pandemic peaks, spread out over time zones to the extent it is practical (i.e. a 4 AM to noon shift, a noon to 8 PM shift, and an 8 PM to midnight shift. Or, another government or non-profit body can agree to collaborate and provide someone in their time zone to maintain the data. It's around a 3 person job per day full time, including the weekends.

becare-rocket commented 4 years ago

Just adding to this, Wikipedia maintains essentially the same data using community maintenance. It is up to date and pretty decent, but it is not in a time series format. If they can do it, it can be done by a Github fork.

lazd commented 4 years ago

@ciscorucinski I think the answer lies in scraping official sources, rather than fielding reports from news articles. What do you think of the following?

Let's begin by compiling a list of sources: ~~a CSV~~ a Google doc with each county in a given state (or state itself, if they have a webpage with all counties listed) and a source webpage.
After that, we'll have to write and maintain scraper rules to pull the data from each of the websites. That's happening here https://github.com/lazd/coronadatascraper
Finally, we can combine these into a single repository that pulls this data on a regular schedule, reports errors to the right person if it fails, and publishes the data when complete. That's also happening here, but it's not automatic yet https://github.com/lazd/coronadatascraper

On the pages I've checked so far, it seems only positive and deaths are reported, so we won't be able to get recovered (or consequently, active)...

I'm going to start on this by trying to get sources for California together.

Edit: I've gathered all the resources I was using. Here's what I've got so far: https://github.com/lazd/coronavirus-data-sources .

ciscorucinski commented 4 years ago

@lazd Doing a county-by-county scrapping effort is going to be a lot and there is no guarantee that layout and format will stay the same for each county. Also, we can't get the county website until they get their 1st case.

I was looking at Wisconsin's Department of Health Services and they provide a list of new releases. Does the state of California have a similar list in one place? That seems more reasonable for sources if that is available, no?

Edit: Forgot to add link... https://www.dhs.wisconsin.gov/outbreaks/index.htm

ciscorucinski commented 4 years ago

@ciscorucinski If we are triaging and verifying before commits anyway, I would suggest getting a group together in a shared google drive/sheet. It works extremely well even for large enterprises to collaborate and it would be the quickest to get up and running.

I agree. A Google sheets could be created to handle this. With specific roles and data protection in place, it could be opened for may. But What's a good format/layout for the Sheets?

zdavatz commented 4 years ago

this is how murtaman in UK does it: https://docs.google.com/spreadsheets/d/1eTKeK9vRxgw0KhvKxPCaDrfaHnxQP-n9TsLzsEymviY/edit#gid=0 - personally I like the layout.

longsyntax commented 4 years ago

I went through and identified source URLs for a few states whose data is all in one place for all their counties. Unfortunately I don't have the scraping expertise - but I'm more than happy to help with anything else I can do.

With regard to identifying sources, lets figure out how best to parse this out so we aren't duplicating efforts.

USA/FL -- http://www.floridahealth.gov/diseases-and-conditions/COVID-19/index.html -- All counties (State DOH website listing stats for all counties)
USA/PA -- https://www.health.pa.gov/topics/disease/Pages/Coronavirus.aspx -- All counties (State DOH website listing stats for all counties)
USA/WA -- https://www.doh.wa.gov/Emergencies/Coronavirus -- All counties (State DOH website listing stats for all counties)
USA/NJ -- https://www.nj.gov/health/cd/topics/covid2019_dashboard.shtml -- All counties (Interactive dashboad embedded in the website)
USA/CO -- https://docs.google.com/document/d/e/2PACX-1vRSxDeeJEaDxir0cCd9Sfji8ZPKzNaCPZnvRCbG63Oa1ztz4B4r7xG_wsoC9ucd_ei3--Pz7UD50yQD/pub -- All counties (State DOH website listing stats for all counties)
USA/AZ -- https://www.azdhs.gov/preparedness/epidemiology-disease-control/infectious-disease-epidemiology/index.php#novel-coronavirus-home -- All counties (State DOH website listing stats for all counties)
USA/CT -- https://portal.ct.gov/Coronavirus -- All counties (State DOH website listing stats for all counties)
USA/DE -- https://www.dhss.delaware.gov/dhss/dph/epi/2019novelcoronavirus.html -- All counties (State DOH website listing stats for all counties)
USA/GA -- https://dph.georgia.gov/georgia-department-public-health-covid-19-daily-status-report -- All counties (Seems like static images of a map are on the website. Not sure how we can extract numbers. )
USA/MA -- https://www.mass.gov/info-details/covid-19-cases-quarantine-and-monitoring -- All counties (They upload a copy of a downloadable PDF/DOCX everyday at 4PM)
USA/NY -- https://www.health.ny.gov/diseases/communicable/coronavirus/ -- All counties (State DOH website listing stats for all counties)

lazd commented 4 years ago

Doing a county-by-county scrapping effort is going to be a lot and there is no guarantee that layout and format will stay the same for each county. Also, we can't get the county website until they get their 1st case.

@ciscorucinski all true. I started work on a scraper that basically has a custom function that gets ran against the body of the website: https://github.com/lazd/coronadatascraper/blob/master/scrapers.js

I've written a few scraper functions already, and it produces something like this:

[
  {
    cases: 21,
    deaths: 0,
    county: 'San Francisco County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.sfdph.org/dph/alerts/coronavirus.asp'
  },
  {
    cases: 20,
    deaths: 0,
    county: 'San Mateo County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.smchealth.org/coronavirus'
  },
  {
    cases: 3,
    county: 'Sonoma County',
    state: 'CA',
    country: 'USA',
    url: 'https://socoemergency.org/emergency/novel-coronavirus/novel-coronavirus-in-sonoma-county/'
  },
  {
    cases: 7,
    county: 'Santa Cruz County',
    state: 'CA',
    country: 'USA',
    url: 'http://www.santacruzhealth.org/HSAHome/HSADivisions/PublicHealth/CommunicableDiseaseControl/Coronavirus.aspx'
  }
]

Like you said, it will not be consistent; it'll have to be done on a case-by-case basis, and maintained if the county changes their website. This may not be sustainable, but its the best shot we have.

I am going to add a Chrome headless browser for sites that need JavaScript, and will work out a way to capture states from @longsyntax's that have aggregate data.

ciscorucinski commented 4 years ago

@longsyntax it seems like state data of counties is within 3 categories. Available in a single webpage, Available within links within a single webpage, and not aggregated.

My Wisconsin link above would be the 2nd case where extra effort would need to be made

DavidGeeraerts commented 4 years ago

Everyone needs to bug their State Health Departments to use standard (best practice) HTML tags, specifically TABLE tag, so that we all don't have to come up with one-off scrappers for all these sites. I'm bugging WDOH.

DavidGeeraerts commented 4 years ago

@ciscorucinski Seems a Slack instance would be super helpful if there's a coordinated effort to get County level data.

lazd commented 4 years ago

@DavidGeeraerts yes, let's get one up an running... Or maybe discord, since it'll keep our chat history (Unless someone at Slack wants to give us a free instance?)

ciscorucinski commented 4 years ago

It's late by me. Here is a Google Sheets that can be expanded on. It's public with editability for now.

https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit?usp=sharing

DavidGeeraerts commented 4 years ago

Slack instance has been created, see ticket 658

lazd commented 4 years ago

@ciscorucinski nice work. I think this will be much easier than trying to work in Git, especially with people making contributions from all over.

I created one for county sources and added the data I have so far: https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit#gid=1477768381

ciscorucinski commented 4 years ago

Feel free to modify it as you see fit. This was just a quick setup with the data above.

Right now, the doc is freely open and anyone can edit. Should I put some restrictions in place, and add via email?

lazd commented 4 years ago

@ciscorucinski I think it's fine to be open for now. I've ran out of data sources to scrape and need more web resources for counties across America. I currently have data scraped for 51 counties (see coronadatascraper).

ciscorucinski commented 4 years ago

@CSSEGISandData Can you pin this issue?

You are able to PIN 3 important issues in the Issues tab. This community-driven effort might be a good candidate for pinning. I say this because it is already buried several pages into the results (page 4). So new people will have a hard time finding it.

CC @saralioness

https://help.github.com/en/github/managing-your-work-on-github/pinning-an-issue-to-your-repository

ssljivar commented 4 years ago

Today an e-mail showed in my inbox with a link to this site:

https://www.worldometers.info/coronavirus/

Their numbers at the moment seem much more up-to-date than the numbers provided by the Johns Hopkins team. Let's hope that the Johns Hopkins team is busy with resolving the reporting delay and recent geographical data inconsistency issues (changing county > state granularity, changing country names for politically correct reasons and then back).

shamilovtim commented 4 years ago

I am here if you guys need a maintainer for Michigan, USA

shamilovtim commented 4 years ago

Michigan data: https://www.michigan.gov/coronavirus

enrichman commented 4 years ago

Hi guys, I've seen some glitches in the Italian data and I've found this issue. Our government has exposed a repo with the official data, also divided by regions (like the states in US or China regions).

Is this the right PR/issue where I can contribute about this? 🇮🇹

I've a repository where I scrape this and the other repo to create some JSON. Maybe I can do something similar to merge them here or on the Google doc.

raysalem commented 4 years ago

for the us might want to scape this page --> https://www.cdc.gov/coronavirus/2019-ncov/cases-in-us.html#investigation

from issue 653: response from https://github.com/heavenly-star https://github.com/CSSEGISandData/COVID-19/issues/653#issuecomment-598869502

klartext commented 4 years ago

Local data could be done into their own git-repos. @zdavatz mentioned the Kanton Zuerich page. If you look there, it links to https://github.com/openZH/covid_19

So, it would be a huge step, if all the data would be available in their own repos. Then these could be used as git submodules by @CSSEGISandData to collect the data.

I think the biggest problem is, that official data sources need to provide their data as csv or better git-repos (Github or elsewhere hosted). But it looks like the git universe is something most of health care officials have never heard of.

aske-cph commented 4 years ago

Why are people trying to separate US and world data? We are one world. We need one nice big data source, not fifteen projects and sheets. I think it's relatively simple:

1) Make new repo with all of the historical data that has been cleaned and ordered from here (no more random mangling each day) 2) Then create a simple data crawler that runs each hour that updates all repo data over a list with each countries official pages (countries without official pages can be added by hundreds of people eager to participate) - extra loops for US state resolution, or local resolution in other countries. 3) People can quickly edit errors, and add higher resolution in the Issues tracker (if designs change, higher resolution is added or a site has stopped working) 4) Crawling is relatively easy and can be done in all kinds of languages (Python, PHP, whatever), we just need a list of all countries and a links to each government / institutional site (most countries have pages that are relatively static in design)

So basically the valuable part of the project is threefold:

1) Non-mangled historical data in a standardised format 2) List of links to all countries official counting pages and pointers to where in the HTML the data is located for the crawler 3) A community effort to clean and update daily

Is this unfeasible? Seems simple and the only very important thing is to choose a format and never change it because all subscribers will be in production.

lazd commented 4 years ago

@aske-cph I'm working on a modular scraper that could do just that. I started with US counties because I live in the US, but I've left a country field in everything for exactly the reason you said -- we are one world

Meanwhile, all of the sources are being gathered by the community here: https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit#gid=0

Next steps are:

Continue to add county data from official sources
Begin adding country data from official sources
Continue to add official sources to the sheet above
Perform releases on github
Setup CI to run automatically
Add more robust testing and validation
Combine with GeoJSON and population data from https://github.com/lazd/coronavirus-data-sources

You're absolutely welcome to contribute, we need all the help we can get. What you can you do?

snewpit commented 4 years ago

I found this thread accidentally while trying to look for solid data sources on precise case locations. Everything I find is country level or lacking in alot of detail.

Reason is I built a platform over the past year which is for situations like this, essentially a crowd sourced news sharing platform. Map based to easily be able to see what is relevant to you. The platform has other small features like sending notifications to users with 5km radius of the posts (if turned on). Abililty to add POI so you get notifications for set locations not just your current location.

My team and I have started crawling the net finding location specific data but there is only 3 of us and we can't find the data. If someone can help point me in the right direction for the data then happy to populate this ourselves, or if we can get a community effort together to map the data (the platform allows users to post themselves), then people can comment with any updates and also validation with upvotes/downvotes.

Platform is available on web, iOS and Android apps - www.snewpit.com

Let me know your thoughts

lazd commented 4 years ago

Can someone figure out how to get data out of these sites that have proper ArcGIS/ERIS maps? ArcGIS: https://www.nj.gov/health/cd/topics/covid2019_dashboard.shtml ERIS: https://www.ncdhhs.gov/covid-19-case-count-nc

I'm sure we can somehow go straight to the datasource, I just haven't had time to look into it. Please reply with insights!

Edit: @mark-otaris figured it out here https://github.com/lazd/coronadatascraper/issues/1#issuecomment-599014225!

lazd commented 4 years ago

@zdavatz I added the Zurich CSV to coronadatascraper.

@enrichman I've added the Italian data to to coronadatascraper.

lazd commented 4 years ago

With the help of a handful of super awesome folks, after about 15 straight hours of work, I've released a dataset that includes county information scraped from government websites: http://blog.lazd.net/coronadatascraper/

The initial release contains:

565 total regions
117 countries
113 states
335 counties
GeoJSON features for 531 out of 565 regions
Population data for 527 out of 565 regions
CSV and JSON files
Data is scraped from government or official sources, cited within the data
Includes the existing JHU data for Countries, Australia, China, a few other random island territories

All the scraper code is on Github and can accept pull requests and new scrapers: https://github.com/lazd/coronadatascraper

The data has the following fields (some empty for certain locations):

city - The city name
county - The county or parish
state - The state, province, or region
country - The country name (currently mixed, will be normalized ISO 3166-1 alpha-3 country code https://github.com/lazd/coronadatascraper/issues/8)
cases - Total number of cases
deaths - Total number of deaths
recovered - Total number recovered
tested - Total number tested
population - The estimated population of the location
lat - Latitude (in CSV only)
long - Longitude (in CSV only)
coordinates - Array of coordinates [longitude, latitude] (in JSON only)
featureId - The index of the location in the features.json GeoJSON FeatureCollection array
url - The source of the data

There is no time series data -- yet https://github.com/lazd/coronadatascraper/issues/9

Scrapers can pull JSON, CSV, or good ol' HTML down and are written in a sort of modular way, with a handful of helpers available to clean up the data. Scrapers can pull in data for anything -- cities, counties, states, countries, or collections thereof. See the existing scrapers for ideas on how to deal with different ways of data being presented, and see the scraper contributing information to get started writing your own.

Your help is needed!

Many sites can't be scraped yet because they're behind a Incapsula CDN or require JavaScript. Hopefully someone can get in there and add headless Chromium so we can scrape them: https://github.com/lazd/coronadatascraper/issues/10

Yes, some of these scrapers will break as sites are updated. There is validation for missing/broken data currently present, and a plan to diff against previous output to warn if data was lost https://github.com/lazd/coronadatascraper/issues/6.

It doesn't deploy automatically yet, maybe someone can set up CI and cron jobs https://github.com/lazd/coronadatascraper/issues/5 and make it automatically deploy to Github releases? https://github.com/lazd/coronadatascraper/issues/3

Of course, it's not complete -- yet. With your help, writing scrapers, fixing bugs, and adding features, we can make the most complete dataset available. There are plenty open issues https://github.com/lazd/coronadatascraper/issues, and there are plenty of places that need scrapers written for them, so hop right in.

Thank you, JHU!

This would not have been possible without this project, @CSSEGISandData's COVID-19 repository, and in fact, some of its data is still used. JHU, YOU ROCK!

tmeacham commented 4 years ago

With the help of a handful of super awesome folks, after about 15 straight hours of work, I've released a dataset that includes county information scraped from government websites: http://blog.lazd.net/coronadatascraper/

The initial release contains:

565 total regions

117 countries

113 states

335 counties

GeoJSON features for 531 out of 565 regions

Population data for 527 out of 565 regions

CSV and JSON files

Data is scraped from government or official sources, cited within the data

Includes the existing JHU data for Countries, Australia, China, a few other random island territories

All the scraper code is on Github and can accept pull requests and new scrapers: https://github.com/lazd/coronadatascraper

The data has the following fields (some empty for certain locations):

city - The city name

county - The county or parish

state - The state, province, or region

country - The country name (currently mixed, will be normalized ISO 3166-1 alpha-3 country code lazd/coronadatascraper#8)

cases - Total number of cases

deaths - Total number of deaths

recovered - Total number recovered

tested - Total number tested

population - The estimated population of the location

lat - Latitude (in CSV only)

long - Longitude (in CSV only)

coordinates - Array of coordinates [longitude, latitude] (in JSON only)

featureId - The index of the location in the features.json GeoJSON FeatureCollection array

url - The source of the data

There is no time series data -- yet lazd/coronadatascraper#9

Scrapers can pull JSON, CSV, or good ol' HTML down and are written in a sort of modular way, with a handful of helpers available to clean up the data. Scrapers can pull in data for anything -- cities, counties, states, countries, or collections thereof. See the existing scrapers for ideas on how to deal with different ways of data being presented, and see the scraper contributing information to get started writing your own.

Your help is needed!

Many sites can't be scraped yet because they're behind a Incapsula CDN or require JavaScript. Hopefully someone can get in there and add headless Chromium so we can scrape them: lazd/coronadatascraper#10

Yes, some of these scrapers will break as sites are updated. There is validation for missing/broken data currently present, and a plan to diff against previous output to warn if data was lost lazd/coronadatascraper#6.

It doesn't deploy automatically yet, maybe someone can set up CI and cron jobs lazd/coronadatascraper#5 and make it automatically deploy to Github releases? lazd/coronadatascraper#3

Of course, it's not complete -- yet. With your help, writing scrapers, fixing bugs, and adding features, we can make the most complete dataset available. There are plenty open issues https://github.com/lazd/coronadatascraper/issues, and there are plenty of places that need scrapers written for them, so hop right in.

Thank you, JHU!

This would not have been possible without this project, @CSSEGISandData's COVID-19 repository, and in fact, some of its data is still used. JHU, YOU ROCK!

This is cool for folks who don't need a timeseries.

ssljivar commented 4 years ago

@lazd This is excellent work, thanks to you and the team that worked on this!

Regarding the time series data, may I suggest that you add a column to your dataset that indicates the data and time that your scrapers detected a change on the source website, then insert a new row in your dataset every time such change is detected, rather than overwriting the previous row for the geography. With that, your dataset could be readily converted into a time series dataset, by the virtue of a simple post-processing step that converts cumulative values into incremental ones.

I have implemented a similar logic against the @CSSEGISandData's datasets, using a proprietary data prep tool called Alteryx. I would be happy to share my Alteryx files and supporting geographical "recode" files that I found necessary to use to isolate my analysis from the various changes to geographical attributes. My Alteryx workflows process the entire @CSSEGISandData dataset nightly and produce a different version of the time series dataset. I am attaching the current sample of this time series dataset in CSV, JSON and AVRO format which I aggregated to the Country and State/Province geographical granularity:

Covid-19 Time Series Dataset as of 2020-03-19.zip

I would be happy to assist with blending the @CSSEGISandData dataset to date with your dataset to produce a time series dataset. I could easily do this in Alteryx, and once the logic of the Alteryx workflows are stable, we could convert them to something open / non-proprietary like Python w/ Pandas.

You can check out my Tableau Public profile for the dashboards that I built solely using this version of the time series dataset: https://public.tableau.com/profile/slaven.sljivar#!/

greg-minshall commented 4 years ago

i agree that dates are very important, and keeping around previous rows with earlier dates. thanks! (so far, sticking with JHU data, but ...)

lazd commented 4 years ago

Thanks @greg-minshall. I just pushed a feature that can go back in time to re-generate the dataset. It works by using cache from a repository, if available, or, if the scraped page has a timeseries, it pulls from that. I will be able to populate JHU data from past days, but it's going to take some massaging/normalizing to get their county data in line, so it will take some time.

Luckily, all of this can be done retroactively, so I'll soon be able to get parity.

lazd commented 4 years ago

@ssljivar I believe we can do this by pulling timeseries data and making scrapers date-aware. I already did this for the JHU data, though it's skipping everything but countries (and Australia, China). See https://github.com/lazd/coronadatascraper#re-generating-old-data

As mentioned in my previous comment, with a little normalization, we should be able to generate a consistent dataset retroactively.

greg-minshall commented 4 years ago

@lazd thanks. a problem with a yarn or git or ... approach is that if past data changes, unless one does the yarn-dance over and over again, one may miss it. it's nice to just do "git pull", and get all the dailies. (i mean, beggars shouldn't be choosers, definitely!)

i think it might be nice for scrapers to report two dates: the date/time of the scrape, and the date/time (if available from the source) the source claims.

lazd commented 4 years ago

@greg-minshall yes, I'm releasing the data on Github pages and Github releases, so you'll be able to always get the latest data just by hitting http://blog.lazd.net/coronadatascraper/data.csv or data.json. Once fully generated and normalized, I will include the timeseries data in releases going forward as data-2020-3-12.csv etc. To get old timeseries data, you'll simply pull the latest release (you won't have to pull old releases to get old data, it will be republished every time).

Getting the date/time from the source is non-trivial and very inconsistent... Though I agree it would be nice to store the date/time of the scrape. Caching complicated that a bit, but it is possible... See https://github.com/lazd/coronadatascraper/issues/18

ciscorucinski commented 4 years ago

What information should be highlighted at the top of this post for newcomers and people trying to parse the information in this comment thread?

Or I will just highlight the Google Doc and we can keep a list of important things that people can do in that Google Doc

lazd commented 4 years ago

@ciscorucinski if you would point folks to the coronadatascraper project, that would be awesome.

We've had scrapers contributed by 3 people besides myself and others actively contributing data validation and verification. I'm currently working on merging timeseries data with normalized JHU data and just now cracked it (producing data in Tidy, JSON, and simple CSV), will be verifying and publishing soon. We're up to 666 individual regions, and more are coming constantly. It's a thing now.

pgyefax commented 4 years ago

@lazd I found this while looking for US county data. I don't have the technical skills to assist, but wanted to offer the following site and data source if useful. Thank you for working on this. https://hgis.uw.edu/virus/ https://coronavirus.1point3acres.com/en

greg-minshall commented 4 years ago

@lazd thanks. if the source provides anything, however bogus, in the way of "as of date", or "published date", i think it would be good to include in the record. it's easier to throw things away than to cons them up. (though, admittedly, there's a danger of people, like me, using them even when they are very bogus.) thanks for thinking of at least having the scrapers include their scraping time in each record -- that's a win. cheers.

CSSEGISandData / COVID-19

County Data - Allow community creation, editing, and verification of this data for #558

📌 Ongoing Information 📌

Background

Suggestion

Ideas

Other Benefits

Your help is needed!

Thank you, JHU!

Your help is needed!

Thank you, JHU!