GeoDaCenter / covid

COVID Atlas alpha code
https://geodacenter.github.io/covid/
GNU General Public License v3.0
47 stars 19 forks source link

Prepare (Secondary) USAFacts Data Stream #35

Closed Makosak closed 4 years ago

Makosak commented 4 years ago

Following findings from county validation team, need to switch data sources with fewer merging issues while retaining accuracy and CDC standards. Chats with @linqinyu and @SteveGoldstein coalesced in sanctioning this move to USAFacts, at least until we flip to a validated multi-source dataset down the road. Validation efforts will continue.

Another idea down the road -- include a drop down of data source so we could include multiple that way.

For now -- switch to USAFacts with the easy FIPS merge? @lixun910 interested in this one?

qinyun-lin commented 4 years ago

Xun and I chatted this afternoon. The 1P3A does have a lot of merging problems. Their county names keep changing (every day!) and new county names keep popping up that cannot be directly merged to the geojson file. But Xun also mentioned that we should keep our 1P3A data ongoing since we have been so far already (which I agree).

(Xukun and Xun are working on the merging problem for today right now. Hopefully, we can set up a procedure so that Xukun and I can clean the merging problem every day (others can help later) and then Xun can more easily update the website.)

The only concern with providing multiple datasets is: if the results are different (say hotspots are different in USAFacts and 1P3A), people would get confused.

So here is what I am proposing (totally open to comments/suggestions): I can work on some codes merging USAFacts with the geojson file tonight. I guess we can aim for switching to USAFacts over the weekend if that works. At the same time, we keep downloading 1P3A data every day and validate it with other sources (but keep it in the back end). I think we should only display one source on the website to avoid confusion. See what you think!


From: Marynia notifications@github.com Sent: Friday, April 3, 2020 3:57 PM To: GeoDaCenter/covid covid@noreply.github.com Cc: Qinyun Lin qinyunlin@uchicago.edu; Mention mention@noreply.github.com Subject: [GeoDaCenter/covid] Switch to USAFacts Data Stream (#35)

Following findings from county validation team, need to switch data sources with fewer merging issues while retaining accuracy and CDC standards. Chats with @linqinyuhttps://github.com/linqinyu and @SteveGoldsteinhttps://github.com/SteveGoldstein coalesced in sanctioning this move to USAFacts, at least until we flip to a validated multi-source dataset down the road. Validation efforts will continue.

Another idea down the road -- include a drop down of data source so we could include multiple that way.

For now -- switch to USAFacts with the easy FIPS merge? @lixun910https://github.com/lixun910 interested in this one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/GeoDaCenter/covid/issues/35, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNP5TM4AH2W2RADOJKR5XTRKZEUPANCNFSM4L4N6MFA.

lanselin commented 4 years ago

Alternatively, if you collect two sources, there could be an option to select the one to be mapped. I don’t think it’s a good idea to switch sources in mid-stream since that makes the maps historically no longer comparable. Consistency is important!

It may make sense to start showing maps using a different source in addition to the current map, but switching sources in mid-stream is a no-go as far as I am concerned.

On Apr 3, 2020, at 4:56 PM, Qinyun Lin notifications@github.com wrote:

Xun and I chatted this afternoon. The 1P3A does have a lot of merging problems. Their county names keep changing (every day!) and new county names keep popping up that cannot be directly merged to the geojson file. But Xun also mentioned that we should keep our 1P3A data ongoing since we have been so far already (which I agree).

(Xukun and Xun are working on the merging problem for today right now. Hopefully, we can set up a procedure so that Xukun and I can clean the merging problem every day (others can help later) and then Xun can more easily update the website.)

The only concern with providing multiple datasets is: if the results are different (say hotspots are different in USAFacts and 1P3A), people would get confused.

So here is what I am proposing (totally open to comments/suggestions): I can work on some codes merging USAFacts with the geojson file tonight. I guess we can aim for switching to USAFacts over the weekend if that works. At the same time, we keep downloading 1P3A data every day and validate it with other sources (but keep it in the back end). I think we should only display one source on the website to avoid confusion. See what you think!


From: Marynia notifications@github.com Sent: Friday, April 3, 2020 3:57 PM To: GeoDaCenter/covid covid@noreply.github.com Cc: Qinyun Lin qinyunlin@uchicago.edu; Mention mention@noreply.github.com Subject: [GeoDaCenter/covid] Switch to USAFacts Data Stream (#35)

Following findings from county validation team, need to switch data sources with fewer merging issues while retaining accuracy and CDC standards. Chats with @linqinyuhttps://github.com/linqinyu and @SteveGoldsteinhttps://github.com/SteveGoldstein coalesced in sanctioning this move to USAFacts, at least until we flip to a validated multi-source dataset down the road. Validation efforts will continue.

Another idea down the road -- include a drop down of data source so we could include multiple that way.

For now -- switch to USAFacts with the easy FIPS merge? @lixun910https://github.com/lixun910 interested in this one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/GeoDaCenter/covid/issues/35, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNP5TM4AH2W2RADOJKR5XTRKZEUPANCNFSM4L4N6MFA. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608705695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6STHZOAYUOL5TLRHQY4FDRKZLPPANCNFSM4L4N6MFA.

Makosak commented 4 years ago

Okay, so it sounds like best to keep parallel efforts to 1) clean 1P3A further and continue its use and 2) have USAfacts ready to go as an option (secondary)? If yes I can update the task here to "prepare USAFacts Data Stream."

I've grown increasingly nervous with 1P3A as their team hasn't responded to any questions for a while, and the county merge issue seems a bit sloppy so that flags a potential validation issue. I think that because it's crowdsourced, some areas have more care taken to validate than others. We mainly chose 1P3A to start as it was the only dataset available at the beginning.

If the results are different between 2 datasets, we could make that available to groups so they could have more information available to them to make the best choice? Even if, especially if, they differ.. @SteveGoldstein had a great idea in creating a metric to capture how much sources disagree for a single county, as a measure of uncertainty -- that could be a future idea to help with this, too.

lanselin commented 4 years ago

I think the comparison might be a useful map in its own right, i.e., highlight where the discrepancies are the greatest. This would be of interest in and of itself in terms of testing the “representativeness” of the crowdsourcing. For example, there could be systematic regional biases (or not).

The importance of sticking with 1P3A is to keep consistency with the earlier maps. One simply cannot build a timeline switching data sources in mid-stream, especially if they may differ in important aspects. No problem adding an extra data set and comparing them, but we need to keep a comparable historical timeline.

On Apr 3, 2020, at 5:07 PM, Marynia notifications@github.com wrote:

Okay, so it sounds like best to keep parallel efforts to 1) clean 1P3A further and continue its use and 2) have USAfacts ready to go as an option (secondary)? If yes I can update the task here to "prepare USAFacts Data Stream."

I've grown increasingly nervous with 1P3A as their team hasn't responded to any questions for a while, and the county merge issue seems a bit sloppy so that flags a potential validation issue. I think that because it's crowdsourced, some areas have more care taken to validate than others. We mainly chose 1P3A to start as it was the only dataset available at the beginning.

If the results are different between 2 datasets, we could make that available to groups so they could have more information available to them to make the best choice? Even if, especially if, they differ.. @SteveGoldstein https://github.com/SteveGoldstein had a great idea in creating a metric to capture how much sources disagree for a single county, as a measure of uncertainty -- that could be a future idea to help with this, too.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608709405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6STH2NRPMZ3HD6X2EILDDRKZMYPANCNFSM4L4N6MFA.

qinyun-lin commented 4 years ago

Makes sense. The plan changed to continue cleaning 1P3A and add USAFacts as another option. Will continue the validation and check systematic discrepancies.

Just found a difference between USAFacts and 1P3A: for some counties in UT, the state report cases for a combination of several counties, rather than separately for each county. For example, we only know there are 2 cases for all 6 counties in the so-called "Central UT" region (Juab, Sanpete, Millard, Sevier, Piute, Wayne), but we don't know which exact county these 2 cases are from. 1P3A captured this and reported these 2 cases. But USAFacts just ignored these cases. For 1P3A, shall we combine/merge these 5 counties into one polygon then?

Makosak commented 4 years ago

Keep them as county only, to match the master county shapefile... Alaska also only reports regions (at least last week they did) so could have similar issues. While health departments are the gold standard for cases reported, we should keep a higher "county file" standard that matches the Census as to maintain county geographies, etc —

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608950589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAULPDUJU3APQLCOZE5SBLRK2EOHANCNFSM4L4N6MFA .

-- Marynia A. Kolak, PhD, MFA, MS Assistant Director of Health Informatics Assistant Instructional Professor in Geographic Information Science Center for Spatial Data Science at the University of Chicago

lanselin commented 4 years ago

Agreed, stick with counties consistently.

On Apr 3, 2020, at 9:15 PM, Marynia notifications@github.com wrote:

Keep them as county only, to match the master county shapefile... Alaska also only reports regions (at least last week they did) so could have similar issues. While health departments are the gold standard for cases reported, we should keep a higher "county file" standard that matches the Census as to maintain county geographies, etc —

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608950589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAULPDUJU3APQLCOZE5SBLRK2EOHANCNFSM4L4N6MFA .

-- Marynia A. Kolak, PhD, MFA, MS Assistant Director of Health Informatics Assistant Instructional Professor in Geographic Information Science Center for Spatial Data Science at the University of Chicago — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608956721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6STH5P7VGA6NV3T4ZIUGTRK2J2TANCNFSM4L4N6MFA.

qinyun-lin commented 4 years ago

Got it. Documented this in our google doc and later will be shared in a method/Readme file.


From: Luc Anselin notifications@github.com Sent: Saturday, April 4, 2020 5:43 AM To: GeoDaCenter/covid covid@noreply.github.com Cc: Qinyun Lin qinyunlin@uchicago.edu; Mention mention@noreply.github.com Subject: Re: [GeoDaCenter/covid] Prepare (Secondary) USAFacts Data Stream (#35)

Agreed, stick with counties consistently.

On Apr 3, 2020, at 9:15 PM, Marynia notifications@github.com wrote:

Keep them as county only, to match the master county shapefile... Alaska also only reports regions (at least last week they did) so could have similar issues. While health departments are the gold standard for cases reported, we should keep a higher "county file" standard that matches the Census as to maintain county geographies, etc —

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608950589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAULPDUJU3APQLCOZE5SBLRK2EOHANCNFSM4L4N6MFA .

-- Marynia A. Kolak, PhD, MFA, MS Assistant Director of Health Informatics Assistant Instructional Professor in Geographic Information Science Center for Spatial Data Science at the University of Chicago — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GeoDaCenter/covid/issues/35#issuecomment-608956721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6STH5P7VGA6NV3T4ZIUGTRK2J2TANCNFSM4L4N6MFA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/GeoDaCenter/covid/issues/35#issuecomment-609009958, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNP5TM24KVIPV6CDJ4I2RDRK4FM7ANCNFSM4L4N6MFA.

Makosak commented 4 years ago

User found a major data discrepancy in data for a county in PA -- the 1P3A website has accurate number, but their API gave us a number that was magnitudes off. +1 for having an alternate data default as primary (okay to keep 1P3A as a secondary), as the errors are getting more extreme.

Makosak commented 4 years ago

Looks great!!