CEIDatUGA / COVID-19-DATA

GNU General Public License v3.0
4 stars 6 forks source link

read_UScases_wikipedia.R pulling incorrect table structure #27

Closed allopole closed 4 years ago

allopole commented 4 years ago

As of this morning, script is now retuning a table with some data misaligned. I'm having the same issue with different code accessing the same wikipedia table. something might have changed with the wiki table structure that's causing the rvest/xml2 code to return wrong results.

lsalvador commented 4 years ago

Hi, thanks for letting me know. I will take a look at it now and try to fix it.

allopole commented 4 years ago

Problem lies with html_table(fill = TRUE) function retuning an extra column before ID

lsalvador commented 4 years ago

A new column 'Date' was added at the end of all state columns. I guess they did that for visualization purposes. Will address that in the code and commit shortly

lsalvador commented 4 years ago

Changed the code to accept new format and pushed it to repo. Also, have two questions: 1) the number of first recovered cases in the wiki table is missing on Feb 15. I changed the code to add these from the cumulative recovered cases column, but I think it would be better if edited in the wiki itself. Any recommendations on how to do this? 2) The table totals have been outputted with a plus sign - I have been deleting these and correcting for the total amounts (sometimes are not correct). Shall I continue to do that, or shall I upload the table exactly as it is in the wiki?

allopole commented 4 years ago

I've never edited on wikipedia, so can't advise you there. Agreed that would be better. I'm not using the cumulative counts or totals myself. My inclination is to throw out the totals row altogether and only download the dated rows. Total can be calculated directly from the data anyway, and a table without a totals row is easier to use right away in R. We link back to wikipedia anyway, so people can go there to see the totals if they want.

lsalvador commented 4 years ago

I like that option - much cleaner. Shall I let people know that that change will be made in case they are using the totals row? I am not sure who is using these files. If yes, what is the best way? slack?

mvevans89 commented 4 years ago

There are no projects listed as using the data on the metadata on github and no one has added any to the project spreadsheet. This may mean no one is using it, but more likely means people just haven't added their projects to the sheet.

Slack is probably the best to use, and I can reply with a gentle reminder for people to update what data their projects are using if they want a notification when a dataset is changed in a significant way

lsalvador commented 4 years ago

Sounds good. Will do that on the next update ~7pm. Thank you both!

metasj commented 4 years ago

Just discovering this project -- fyi you can leave edit requests on the talk page for the template or article (or ping me about it) to update the underlying WP table, if you don't want to edit it directly.

mvevans89 commented 4 years ago

Thanks! I'll keep this in mind if we run into any more formatting issues on our end