como-ph / oxcovid19

An R API to the Oxford COVID-19 Database
https://como-ph.github.io/oxcovid19
GNU General Public License v3.0
12 stars 1 forks source link

oxcovid19 database website has changed structure #28

Closed ernestguevarra closed 4 years ago

ernestguevarra commented 4 years ago

Need to update sources list and structures list but oxcovid19 database website has changed structure so data scraping functions in data-raw most likely doesn't work anymore

aezarebski commented 4 years ago

@ernestguevarra can you provide a link to the effected part of the code please? The relevant data is here when you click the "source" tab which reveals it (it's already in the HTML anyway). Perhaps we could make the data available in a convenient form on the page and then just fetch it when requested.

ernestguevarra commented 4 years ago

@ernestguevarra can you provide a link to the effected part of the code please? The relevant data is here when you click the "source" tab which reveals it (it's already in the HTML anyway). Perhaps we could make the data available in a convenient form on the page and then just fetch it when requested.

Yes, I understand that that the information is still there and the actual weblink doesn't change. However, the tables in the HTML for sources have changed and cannot be read in the same way as it was in the original website format. The schema tables work fine, however, so that is not affected.

This doesn't affect any of the functions in the package. It affects the code in the data-raw folder that scrapes this information from the website and creates corresponding data_sources and data_structures datasets included in the package. I don't think the data structure information has changed (but I will check and updated the data for that) but I do think that list of data sources have expanded so I want to update that. Worse case scenario is that I will just copy paste the data sources tables and re-create them into the data_sources dataset.

Thanks.

ernestguevarra commented 4 years ago

Issue seems to be that the data sources page (https://covid19.eng.ox.ac.uk/data_sources.html) now uses a javascript function to generate the sources table into the new site structure using HTML from the old website structure that is saved in the OxCOVID19 Project's GitHub account (see https://raw.githubusercontent.com/covid19db/web-page-data/master/html/data_sources.html). This means that the actual table of sources text is not on the HTML itself but generated on the fly when the specific page is visited. Fix is to scrape the information from the original HTML url above. This has now been implemented and is included in the latest PR