CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

[SOLUTION] - Pandas + Python - Pulls daily data for Recoveries @worldmeters.info and transforms in @JHU standards #1642

Open r-lomba opened 4 years ago

r-lomba commented 4 years ago

For Pandas + Python users, I propose the following code snippet to seamlessly integrate daily data from worldometers.info straight inside your legacy code generating stuff from recoveries stats @JHU (as we all know, this data stream is now discontinued and we are left on our own, so to say)

IMPORTANT:

#############################################################
##
## PRELIMINARY OPERATIONS
##
#############################################################

# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import dateutil
import time
import copy
import requests
from datetime import datetime, timedelta, date

#############################################################
##
## HARDCODED DICTIONARIES - FIXES TO COUNTRIES THAT ARE NAMED
## DIFFERENTLY IN WORLDOMETER STANDARDS THAN JHU STANDARDS
## HERE WE RENAME COUNTRIES IN RECOVERED DATAFRAME FROM WORLDOMETER
## (ON THE LEFT HAND SIDE) TO MAKE THE COUNTRY NAMES MATCH WITH
## THOSE USED AT JHU (ON THE RIGHT HAND SIDE)
##
## PLEASE NOTE THAT FOREIGN TERRITORIES OF MAIN COUNTRIES ARE
## RENAMED AS THE MAIN COUNTRY ITSELF: THIS BECAUSE WE SUPPOSE STATS
## ARE AGGREGATED AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
##
#############################################################

# DEFINES COUNTRY NAMES FIXES DICTIONARY
countries_fixes_dict = {'CAR': 'Central African Republic',
                        'Congo': 'Congo (Brazzaville)',
                        'DRC': 'Congo (Kinshasa)',
                        'Diamond Princess': 'Cruise Ship',
                        'Ivory Coast': 'Cote d\'Ivoire',
                        'S. Korea': 'Korea, South',
                        'St. Vincent Grenadines': 'Saint Vincent and the Grenadines',
                        'Taiwan': 'Taiwan*',
                        'UAE': 'United Arab Emirates',
                        'UK': 'United Kingdom',
                        'USA': 'US',
                        'Vatican City': 'Holy See',
                        'Aruba': 'Netherlands',
                        'Bermuda': 'United Kingdom',
                        'Cayman Islands': 'United Kingdom',
                        'Channel Islands': 'United Kingdom',
                        'Curaçao': 'Netherlands',
                        'Faeroe Islands': 'Denmark',
                        'French Polynesia': 'France',
                        'French Guiana': 'France',
                        'Gibraltar': 'United Kingdom',
                        'Greenland': 'Denmark',
                        'Guadeloupe': 'France',
                        'Guam': 'US',
                        'Hong Kong': 'China',
                        'Isle of Man': 'United Kingdom',
                        'Macao': 'China',
                        'Martinique': 'France',
                        'Mayotte': 'France',
                        'Montserrat': 'United Kingdom',
                        'New Caledonia': 'France',
                        'Puerto Rico': 'US',
                        'Réunion': 'France',
                        'Saint Martin': 'France',
                        'Sint Maarten': 'Netherlands',
                        'St. Barth': 'France',
                        'Turks and Caicos': 'United Kingdom',
                        'U.S. Virgin Islands': 'US',
                        'Bahamas': 'Bahamas, The',
                        'Gambia': 'Gambia, The'
                        }

#############################################################
##
## READS WORLDOMETER STATS
##
## WEB PAGES ARE COMPLEX AND CAN CONTAIN SEVERAL TABLES. THIS IS WHY READ_HTML RETURNS
## A LIST OF TABLES. IN THIS CASE THE TABLES WE NEED ARE THE FIRST IN THE LIST
## (HENCE THE INDEXING [0]) CONTAINING THE WORLDOMETER STATS FROM TODAY, AND THE
## SECOND ONE (HENCE THE INDEXING [1]) CONTAINING THE WORLDOMETER STATS FROM YESERDAY.
## THE OPTION "header=[0]" MEANS THAT JUST THE FIRST ROW OF
## THE TABLE MAKE UP THE HEADER OTHERWISE IT COULD BE E.G. "header=[0,1]" AND IN
## THIS CASE IT WOULD MEAN THE HEADER IS MADE UP FROM THE FIRST 2 LINES OF
## THE SCRAPED TABLE. PLEASE NOTE THAT THE LAST CALL IN THE FOLLOWING CODE LINES
## IS ".iloc[:-1]" AS TO DISCARD THE LAST LINE OF THE SCRAPED TABLE: THAT WOULD
## CONTAIN GRANDTOTALS THAT WE'RE NOT INTERESTED IN
##
#############################################################

# PULLS TODAY'S AND YESTERDAY'S DATA FROM WORLDOMETER SITE
url = 'https://www.worldometers.info/coronavirus/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

r = requests.get(url, headers=header)

worldometer_stats_today_orig_DF = pd.read_html(r.text, header=[0], index_col=0)[0].iloc[:-1]
worldometer_stats_yesterday_orig_DF = pd.read_html(r.text, header=[0], index_col=0)[1].iloc[:-1]

#############################################################
##
## A FEW VARIABLE DEFINITIONS
##
#############################################################

# CURRENT AND YESTERDAY DATE AND TIME
now = datetime.now()
yesterday = date.today() - timedelta(days=1)

# BUILDS TRAILER STRINGS TO APPEND TO THE FILE NAMES WE WILL USE AND TO USE TO FIX COLUMN NAMES
# IN THE NEW RECOVERIES DATAFRAME THAT WE'LL INTEGRATE IN JHU DATA
date_time_trailer = now.strftime("%Y_%m_%d")

# IF YOU USE WINDOWS, SWITCH COMMENT ON THE NEXT TWO LINES OF CODE
# MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
# https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
date_time_columnname = yesterday.strftime("%-m/%-d/%Y")
#date_time_columnname = yesterday.strftime("%m/%d/%Y")

#############################################################
##
## PUTTING TOGETHER ALL THE PIECES
##
#############################################################

# RENAMES NEW DATAFRAMES INDEX TO MATCH JHU DATAFRAME STANDARDS
worldometer_stats_today_orig_DF.index.names = ['Country/Region']
worldometer_stats_yesterday_orig_DF.index.names = ['Country/Region']

# FIXES DATATYPES AND COLUMN NAMES ON THE NEW RECOVERED STATS DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
# HERE, WE JUST USE "TODAY" DATAFRAME BECAUSE WE SUPPOSE WE ARE GENERATING OUR STATS ON THE "NEW DAY"
# SO WE HAVE TO LOOK AT NEW ACTIVE VALUES SINCE YESTERDAY
worldometer_recovered_orig_DF = worldometer_stats_today_orig_DF['TotalRecovered'].copy().to_frame()
worldometer_recovered_orig_DF['TotalRecovered'] = worldometer_recovered_orig_DF['TotalRecovered'].fillna(0).astype(np.int64)

# RENAMES NEW DATAFRAME RECOVERED COLUMN NAME TO MATCH JHU DATAFRAME STANDARDS
# USING THE "date_time_yesterday_columnname" VARIABLE WE PREVIOUSLY INITIALIZED
# REMEMBER WE WANT TO CREATE A NEW COLUMN TO HSTACK TO RECOVERED INCREMENTAL CSV
# AS YESTERDAY'S STATS/COLUMN NAME!
worldometer_recovered_orig_DF = worldometer_recovered_orig_DF.rename(columns={'TotalRecovered': date_time_columnname})

# RENAMES COUNTRY NAMES INSIDE THE INDEX OF OUR NEW DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
# USING THE FIX INDEX DICTIONARY "countries_fixes_dict" THAT WE PREVIOUSLY INITIALIZED
worldometer_recovered_orig_DF = worldometer_recovered_orig_DF.rename(index = countries_fixes_dict)

# AS NOW WE'LL HAVE MULTIPLE RECORDS FOR MANY COUNTRIES HAVING FOREIGN TERRITORIES (WE RENAMED
# ALL THE TERRITORIES AS THE MAIN COUNTRIES) WE GROUP AND SUM THE DATAFRAME AT COUNTRY LEVEL
# WE ALSO CHANGE NAME TO THIS DATAFRAME BECAUSE THIS IS THE GROUPED BY COUNTRY VERSION OF THE
# ORIGINAL WORLDCOUNTER DATAFRAME
worldometer_recovered_DF = worldometer_recovered_orig_DF.groupby(['Country/Region']).sum()

# SAVES THE "worldometer_recovered_DF" DATAFRAME WITH A PROPER NAME ENRICHED WITH DATE AND TIME
# SO THAT WE HAVE A VERSION OF THE ORIGINAL DATA
worldometer_recovered_DF.to_csv(r'./worldometer_recovered_' + date_time_trailer + '.csv', index = True)

# LET'S PEEK INSIDE THE GENERATED DATAFRAME. YOU CAN NOW JOIN/HSTACK IT WITH THE DATAFRAME
# CONTAINING THE LATEST DATA AVAILABLE @JHU WITH REGARDS TO RECOVERIES! WATCH OUT: HERE WE
# AGGREGATED EVERYTHING AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
worldometer_recovered_DF.head()

OUTPUT: Output

mainadwitiya commented 4 years ago

noiceeeeeeeeeeeeeeeeeeee

ablanch5 commented 4 years ago

it seems you're missing import numpy as np.

Also line 110: date_time_columnname = yesterday.strftime("%-m/%-d/%Y") was throwing me an error. I took away the dashes and it worked: date_time_columnname = yesterday.strftime("%m/%d/%Y")

I'm now correctly getting worldometer_recovered_2020_03_26.csv correctly in my folder

P.S. I also had to pip install lxml

r-lomba commented 4 years ago

@ablanch5 sorry, I forgot a few imports. I modified the code above, can you try now?

r-lomba commented 4 years ago

@ablanch5 the dashes are to avoid padding in the date e.g. NOT "03/25/2020" but instead "3/25/2020" so I used the dashes to be compliant with JHU... Maybe your python version does not support that format?

ablanch5 commented 4 years ago

I'm using python 3.7.6. When I look at the csv file I'm getting 3/25/2020 like you wanted. Could it be a Linux/Windows thing? I don't know

r-lomba commented 4 years ago

@ablanch5 I don't know. I use Linux. Glad it worked, anyway! Enjoy :)

ablanch5 commented 4 years ago

I think its a windows/linux thing: https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows

Nice work!

efra-mx commented 4 years ago

Something like this would fix the date format issue:

import os
...
...
# MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
# https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
if os.name == 'nt':
    date_time_columnname = yesterday.strftime("%m/%d/%Y")
else:
    date_time_columnname = yesterday.strftime("%-m/%-d/%Y")
r-lomba commented 4 years ago

@efra-mx have you tested it? If you have and you sure it works, I'll update the code snippet. Otherwise if not tested, let's just leave your comment here for inspiration for future readers.

In the meanwhile I've updated the code because since today if you don't "pretend to be a browser" they give you a 403 error. Now the problem is fixed (I don't get why they try to protect a public HTML table on a public page, in times of pandemic... Nevertheless, for now it still works with the updated code)

efra-mx commented 4 years ago

@r-lomba I don't have a windows machine myself. But you can check: https://stackoverflow.com/questions/1325581/how-do-i-check-if-im-running-on-windows-in-python https://stackoverflow.com/questions/1854/what-os-am-i-running-on

Good job by the way.

efra-mx commented 4 years ago

I wanted to share a refactored version of the code done by @r-lomba. It is wrapped in a class It will be easier to use.

r = CovidReader() r = fetch() print(r.recovered_by_country())

I hope you find it helpful

import os
import pandas as pd
import numpy as np
import dateutil
import time
import copy
from datetime import datetime, timedelta, date
import requests

#############################################################
##
## HARDCODED DICTIONARIES - FIXES TO COUNTRIES THAT ARE NAMED
## DIFFERENTLY IN WORLDOMETER STANDARDS THAN JHU STANDARDS
## HERE WE RENAME COUNTRIES IN RECOVERED DATAFRAME FROM WORLDOMETER
## (ON THE LEFT HAND SIDE) TO MAKE THE COUNTRY NAMES MATCH WITH
## THOSE USED AT JHU (ON THE RIGHT HAND SIDE)
##
## PLEASE NOTE THAT FOREIGN TERRITORIES OF MAIN COUNTRIES ARE
## RENAMED AS THE MAIN COUNTRY ITSELF: THIS BECAUSE WE SUPPOSE STATS
## ARE AGGREGATED AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
##
#############################################################

# DEFINES COUNTRY NAMES FIXES DICTIONARY
countries_fixes_dict = {'CAR': 'Central African Republic',
                        'Congo': 'Congo (Brazzaville)',
                        'DRC': 'Congo (Kinshasa)',
                        'Diamond Princess': 'Cruise Ship',
                        'Ivory Coast': 'Cote d\'Ivoire',
                        'S. Korea': 'Korea, South',
                        'St. Vincent Grenadines': 'Saint Vincent and the Grenadines',
                        'Taiwan': 'Taiwan*',
                        'UAE': 'United Arab Emirates',
                        'UK': 'United Kingdom',
                        'USA': 'US',
                        'Vatican City': 'Holy See',
                        'Aruba': 'Netherlands',
                        'Bermuda': 'United Kingdom',
                        'Cayman Islands': 'United Kingdom',
                        'Channel Islands': 'United Kingdom',
                        'Curaçao': 'Netherlands',
                        'Faeroe Islands': 'Denmark',
                        'French Polynesia': 'France',
                        'French Guiana': 'France',
                        'Gibraltar': 'United Kingdom',
                        'Greenland': 'Denmark',
                        'Guadeloupe': 'France',
                        'Guam': 'US',
                        'Hong Kong': 'China',
                        'Isle of Man': 'United Kingdom',
                        'Macao': 'China',
                        'Martinique': 'France',
                        'Mayotte': 'France',
                        'Montserrat': 'United Kingdom',
                        'New Caledonia': 'France',
                        'Puerto Rico': 'US',
                        'Réunion': 'France',
                        'Saint Martin': 'France',
                        'Sint Maarten': 'Netherlands',
                        'St. Barth': 'France',
                        'Turks and Caicos': 'United Kingdom',
                        'U.S. Virgin Islands': 'US',
                        'Bahamas': 'Bahamas, The',
                        'Gambia': 'Gambia, The'
                        }

#############################################################
##
## READS WORLDOMETER STATS
##
## WEB PAGES ARE COMPLEX AND CAN CONTAIN SEVERAL TABLES. THIS IS WHY READ_HTML RETURNS
## A LIST OF TABLES. IN THIS CASE THE TABLES WE NEED ARE THE FIRST IN THE LIST
## (HENCE THE INDEXING [0]) CONTAINING THE WORLDOMETER STATS FROM TODAY, AND THE
## SECOND ONE (HENCE THE INDEXING [1]) CONTAINING THE WORLDOMETER STATS FROM YESERDAY.
## THE OPTION "header=[0]" MEANS THAT JUST THE FIRST ROW OF
## THE TABLE MAKE UP THE HEADER OTHERWISE IT COULD BE E.G. "header=[0,1]" AND IN
## THIS CASE IT WOULD MEAN THE HEADER IS MADE UP FROM THE FIRST 2 LINES OF
## THE SCRAPED TABLE. PLEASE NOTE THAT THE LAST CALL IN THE FOLLOWING CODE LINES
## IS ".iloc[:-1]" AS TO DISCARD THE LAST LINE OF THE SCRAPED TABLE: THAT WOULD
## CONTAIN GRANDTOTALS THAT WE'RE NOT INTERESTED IN
##
#############################################################
class CovidReader:
    url = "https://www.worldometers.info/coronavirus/"
    header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }

    def __init__(self):
        self._now = datetime.now()
        self._yesterday = date.today() - timedelta(days=1)
        self._today_df = None
        self._yesterday_df = None
        self._recovered_df = None

    def now(self):
        return self._now

    def yesterday(self):
        return self._yesterday

    def today_df(self):
        return self._today_df

    def yesterday_df(self):
        return self._yesterday_df

    def recovered_df(self):
        return self._recovered_df

    def fetch(self):
        self._fetch_data()
        self._fix_data()

    # PULLS TODAY'S AND YESTERDAY'S DATA FROM WORLDOMETER SITE
    def _fetch_data(self):
        self._now = datetime.now()
        self._yesterday = date.today() - timedelta(days=1)

        r = requests.get(CovidReader.url, headers=CovidReader.header)
        self._today_df = pd.read_html(r.text, header=[0], index_col=0)[0].iloc[:-1]
        self._yesterday_df = pd.read_html(r.text, header=[0], index_col=0)[1].iloc[:-1]

    def _fix_data(self):

        # RENAMES NEW DATAFRAMES INDEX TO MATCH JHU DATAFRAME STANDARDS
        self._today_df.index.names = ['Country/Region']
        self._yesterday_df.index.names = ['Country/Region']

        # FIXES DATATYPES AND COLUMN NAMES ON THE NEW RECOVERED STATS DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
        # HERE, WE JUST USE "TODAY" DATAFRAME BECAUSE WE SUPPOSE WE ARE GENERATING OUR STATS ON THE "NEW DAY"
        # SO WE HAVE TO LOOK AT NEW ACTIVE VALUES SINCE YESTERDAY
        self._recovered_df = self._today_df['TotalRecovered'].copy().to_frame()
        self._recovered_df['TotalRecovered'] = self._recovered_df['TotalRecovered'].fillna(0).astype(np.int64)

        # RENAMES NEW DATAFRAME RECOVERED COLUMN NAME TO MATCH JHU DATAFRAME STANDARDS
        # USING THE "date_time_yesterday_columnname" VARIABLE WE PREVIOUSLY INITIALIZED
        # REMEMBER WE WANT TO CREATE A NEW COLUMN TO HSTACK TO RECOVERED INCREMENTAL CSV
        # AS YESTERDAY'S STATS/COLUMN NAME!
        #
        # MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
        # https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
        if os.name == 'nt':
            date_time_columnname = self._yesterday.strftime("%m/%d/%Y")
        else:
            date_time_columnname = self._yesterday.strftime("%-m/%-d/%Y")
        self._recovered_df = self._recovered_df.rename(columns={'TotalRecovered': date_time_columnname})

        # RENAMES COUNTRY NAMES INSIDE THE INDEX OF OUR NEW DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
        # USING THE FIX INDEX DICTIONARY "countries_fixes_dict" THAT WE PREVIOUSLY INITIALIZED
        self._recovered_df = self._recovered_df.rename(index = countries_fixes_dict)

    # AS NOW WE'LL HAVE MULTIPLE RECORDS FOR MANY COUNTRIES HAVING FOREIGN TERRITORIES (WE RENAMED
    # ALL THE TERRITORIES AS THE MAIN COUNTRIES) WE GROUP AND SUM THE DATAFRAME AT COUNTRY LEVEL
    # WE ALSO CHANGE NAME TO THIS DATAFRAME BECAUSE THIS IS THE GROUPED BY COUNTRY VERSION OF THE
    # ORIGINAL WORLDCOUNTER DATAFRAME
    def recovered_by_country(self):
        df = self._recovered_df.groupby(['Country/Region']).sum()
        return df

    # SAVES THE "_recovered_df" DATAFRAME WITH A PROPER NAME ENRICHED WITH DATE AND TIME
    # SO THAT WE HAVE A VERSION OF THE ORIGINAL DATA
    def save_recovered(self, basename=r'./covid_recovered_', **kwargs):
        # BUILDS TRAILER STRINGS TO APPEND TO THE FILE NAMES WE WILL USE AND TO USE TO FIX COLUMN NAMES
        # IN THE NEW RECOVERIES DATAFRAME THAT WE'LL INTEGRATE IN JHU DATA
        date_time_trailer = self._now.strftime("%Y_%m_%d")
        self._recovered_df.to_csv(basename + date_time_trailer + '.csv', **kwargs)
r-lomba commented 4 years ago

Brilliant, @efra-mx !! Thanks a lot for this :)

marco-cheung commented 4 years ago

Since 'Diamond Princess' cases are seperated from 'Japan' in Worldometers table, may I know how to group Diamond Princess cases into Japan? Thanks

image

efra-mx commented 4 years ago

Something like this would help :

df.groupby(['country/other']).sum()

On Thu, 9 Apr 2020, 16:08 Marco Cheung notifications@github.com wrote:

Since 'Diamond Princess' cases are seperated from 'Japan' in Worldometers table, may I know how to group Diamond Princess cases into Japan? Thanks

[image: image] https://user-images.githubusercontent.com/29329279/78903825-46dce000-7aae-11ea-96cd-b7719671330e.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CSSEGISandData/COVID-19/issues/1642#issuecomment-611547727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C7WZXYBRP4ZKESMNKVZDRLXJFZANCNFSM4LUO3NVQ .

marco-cheung commented 4 years ago

@efra-mx Thanks for follow-up. Can group them using this function after converting object data fields into float. However, the original column 'Continent' is missing after grouping. May I know how to add back into dataframe? Thanks!

image

efra-mx commented 4 years ago

the easiest way would be, if you also use the version of my script:

import worldmeter_stats as wms

reader = wms.CovidReader() reader.fetch() df = reader.today_df() print(df)

sum

japan = df.loc[['Japan', 'Diamond Princess']].sum(numeric_only=True)

update

df.loc['Japan'] = japan df = df.drop('Diamond Princess')

print(df.loc[['Japan']])

El jue., 9 de abr. de 2020 a la(s) 17:51, Marco Cheung ( notifications@github.com) escribió:

@efra-mx https://github.com/efra-mx Thanks for follow-up. However, the result does not what I expect. I checked all column fields are now in 'object' datatype. How can I change to float/int so that it can be grouped properly?

[image: image] https://user-images.githubusercontent.com/29329279/78914184-a4782900-7abc-11ea-8d3d-5a479aa63230.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CSSEGISandData/COVID-19/issues/1642#issuecomment-611603098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5C7W5SRAC3BFLJX6QXMDLRLXVGPANCNFSM4LUO3NVQ .

marco-cheung commented 4 years ago

Thanks @efra-mx! That works for me and sounds straightforward to understand your code with comments!

Seelova commented 3 years ago

I have this error running the code ValueError: No tables found matching pattern '.+'

How to fix that?