CEIDatUGA / COVID-19-DATA

GNU General Public License v3.0
4 stars 6 forks source link

State fatalities #4

Closed jdrakephd closed 4 years ago

jdrakephd commented 4 years ago

Do we have a dataset for state fatalities? If not, can we scrape this from wikipedia too?

lsalvador commented 4 years ago

The US state cases file has number of daily deaths per state: https://github.com/CEIDatUGA/COVID-19-DATA/blob/master/UScases_by_state_wikipedia.csv

renikaul commented 4 years ago

Looking at the US state case we have daily deaths for the whole country, not by state. @lsalvador am I missing something? @jdrakephd do you want a i)total for state fatalities, ii) state fatalities by date, or a iii) running total of state fatalities by date? I can get you (i) within 30 min, but the others will take longer.

lsalvador commented 4 years ago

@renikaul you are absolutely right, we only have the totals per day on that table. I got confused with the other table existent on the wiki page

renikaul commented 4 years ago

I'm looking at your code right now for future daily scraping... I think I have it figured out so we can get fatality by state. Can you see if you can modify the world wiki code's wayback machine to get the fatality by state by date breakdown?

jdrakephd commented 4 years ago

I was thinking fatalities by day by state. I don't need it for anything I'm doing right at the moment (cumulative fatalities are already in the table we have) but I do look at it frequently to ballpark things like underreporting rate or date of first cases... because I think deaths are better reported than anything else. I think it makes sense for us to start working toward a proper analysis of such data.

renikaul commented 4 years ago

ok. I will dig into this more tomorrow afternoon.

@lsalvador I'm comparing the numbers in the table output by the script and what is displayed on the webpage. For some states, the total number in the table output and the "Cases" column of the wiki page don't match. I think these numbers should match? I also can't seem to figure out how the html source code is capturing fatality by state- there is a column name, but no data. Any ideas?

This is my first experience with html scraping so I'm a bit slow.

rlrichards commented 4 years ago

@lsalvador if you are going to build something for states off my world way back machine scraping code let me know as there are a few kinks that I had to work out that while commented may not make sense clearly to not me.

lsalvador commented 4 years ago

@renikaul I see what you mean, we had more cases in our stored table than in the wiki. As an example, yesterday at 21:21 there were 56 new cases in GA and today the number is 25. Could have been a typo. I have just updated the table and the numbers are matching. However, let me know if you find anything else that is not quite right.

Regarding the fatalities table by state, I can quickly extend the code to incorporate it and output it on a daily basis. The html scrapping involves a few steps that sometimes are not clear (it took a bit of trial and error), but once that it is done, it is quite straightforward. Let me know if you want go through the code together.

lsalvador commented 4 years ago

The other table on the wiki only has total deaths per state. The only information we can get daily from the wiki is the number of cases. However, the COVID Tracking Project has daily information on deaths but as Robbie mentioned only starts reporting on March 4th. Maybe combining this file with the US line list info collected by David and Paige we will be able to have a complete dataset

Example of COVID Tracking Project US daily info: date state positive negative pending death total dateChecked
20200316 AK 1 143     144 2020-03-16T20:00:00Z
20200316 AL 28 28 40 0 96 2020-03-16T20:00:00Z
20200316 AR 22 132 14   168 2020-03-16T20:00:00Z
20200316 AZ 18 182 63 0 263 2020-03-16T20:00:00Z
lsalvador commented 4 years ago

In case it is still useful, I have the code to extract the totals from the wiki page with the format below. Let me know if it this table ande code should be pushed to github

state cases recovered deaths remaining
Alabama 39 0 0 39
Alaska 3 0 0 3
American Samoa 0 0 0 0
Arizona 13 1 0 12
jdrakephd commented 4 years ago

Liliana,

@lsalvador I think you are saying that wikipedia doesn't have deaths by state by day, but that's not correct. When I view the wikipedia page there is a table for fatalities just like the one for cases.

lsalvador commented 4 years ago

@jdrakephd The only table I see is the one with total information. Can you still see the daily one in your browser?

jdrakephd commented 4 years ago

It disappeared when I refreshed my browser - but it had historical data back to the beginning. It has to be somewhere on the Internet, perhaps using wayback machine

lsalvador commented 4 years ago

Found it. Version from March 16 @ 13:21

anabento commented 4 years ago

This page has tracked all deaths

https://www.google.com/amp/s/amp.cnn.com/cnn/2020/03/17/health/coronavirus-united-states-deaths/index.html

And it gets updated just needs a good scrape

Hope this helps. Ana

Sorry for being terse: Fat fingers ∪ small keys

On Mar 18, 2020, at 08:56, Liliana Salvador notifications@github.com wrote:

 Found it. Version from March 16 @ 13:21

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

lsalvador commented 4 years ago

@jdrakephd, I added daily fatalities by state table to github data repo - still pulls data from the old version of the wiki. Will keep an eye for an updated version and README file will be updated shortly

@anabento, that information is very useful to have - thank you

renikaul commented 4 years ago

@jdrakephd I think @lsalvador posted the data you were looking for. I will close the issue. Please re-open if needed.