Closed dafriedman97 closed 4 years ago
Thanks here as well. I am looking at bringing it into the data explorer that is currently being built here:. Time series would be great, but some of this could be achieved by regularly pulling the data, I imagine. There is a project working on the time series here:https://github.com/datasciencecampus/mobility-report-data-extractor. It does not have location information, however, attached to the exports, which makes it hard to join..
I didn't test this code, but the author claims that his script allows you to collect data from graphs.
I'll try it out and let you know. Thanks so much!
The only downside is that it requires manual download. Also, the first group is now looking to add place-names as well, but they are manual download as well. I am using this data in my dashboard currently but would love to have time-series eventually.
@ladew222 followed you here from the otherissue. Would you like a big 'ol csv of all US county data? I got as far as extracting the county names and order, which lets me associate them to the correct ./output/US-*/csv/n.csv
. If I looped through and merged them all, I could just share that .csv directly...
That would be amazing @jwhendy. It would be great to have time series as my tool uses correlations based on COVID data from https://github.com/tomquisel/covid19-data as well as census data--and users would be able to see how this behavior is correlating on particular dates at the state and county level.
I didn't wait :) Working on it now, though had a glitch I caught in multi-named places (US-New, US-South).
Interesting on what you're doing... my brother and I just started putzing on some analysis as well. I had very similar ideas, wondering if there were trends that could be identified. Our basic aim was to create an "anomaly" or "outlier" detection system. In this way, say, counties with similar characteristics could be compared, and perhaps local officials could be alerted when things looked "much worse" than similar counties.
To do that, I realized, one needs at least some understanding of what a county should look like, which was tough. I've been bringing in census data as well. So far I have age, income, population and land (latter two allowing us to calc population density). % mobility seemed to be the next logical thing.
Honestly if that fails, I think I'll concede that our data/testing situation is so bad, nothing can be said at all. As a test case, if we can't differentiate similar counties between SF (very early stay at home) and, say, FL (April 3)... either staying at home has zero effect, the model is so complex that even this intuitively-dominant variable isn't predictive, or the data is utterly garbage.
Thoughts? And if we're kind of doing the same thing (overall visualizer/explorer)... interest in working on that together?
df.county=='summary'
) and the states themselves (df.area=='United States'
) so there's redundancy in that setLet me know if you find any errors. Currently not verified, but the plots are cool :) Look at the dual modes of residential and workplace, which might indicate those are the most robust indicators of what people are doing?
Cool. Just read your message. We have similar goals, although your work is more statistically grounded. I did some early analysis in R, but then moved in the direction of visualization for general consumption which led me to use D3. I allows users to view strength of correlation using a Pearson JS library but it not comprehensive in that regard. If the stats got more intense I would need to use a Python backend server most likely, which I do in other cases. Basically this visual is a way for people to look at Choropleth and visualize a bunch of variables that we are all thinking about. But I think your work is a complement as it is designed for different purposes and a different audiences. Great idea. I will take a look. Thanks
I think I'm doing something similar, basically trying to visualize various relationships to see what matters. I think I'm just doing that via scatter and line plots vs. you using maps. Overall, my "statistical" approach is a rather blind hunting for obvious correlation.
So far, they aren't really jumping out at me. I think if any visualizer (or statistical analysis) is useful (vs. just being interesting to look at), it needs to provide some explanation for, say, CA vs. FL. Or NY vs. others.
If the goal is just to see a dashboard, the above is probably irrelevant. If the goal is after some sort of insights, however, I thought it best to start at the extremes (low hanging fruit) and see what variables help make sense of these things. Admittedly the data might not exist... like which data set contains "number of huge COVID-infested cruise ships docking per state"?!
Cool dashboards, btw! I hadn't actually looked, but just did.
Thanks. And let me know if you see anything that does not make sense or something is missing. BTW, do you have a non-summerized version of the data you extracted, where the values for each county on a date can be discerned. I would need this to integrate it into the county level map. If not, I can go back and work on that one.
Hmmm. So you mean one entry per county per category per date? If so, is this seems like what you want? (I admit I had a filter on the wrong column name which didn't remove state level data, but the county level stuff was still there. Edit: that's fixed now; summary rows and overall US/state data is removed now).
df_ts.head(10)
state county seg date value
0 Alabama Autauga Retail & recreation 2020-02-16 0.173
1 Alabama Autauga Retail & recreation 2020-02-17 7.732
2 Alabama Autauga Retail & recreation 2020-02-18 -1.675
3 Alabama Autauga Retail & recreation 2020-02-19 -1.663
4 Alabama Autauga Retail & recreation 2020-02-20 -7.739
5 Alabama Autauga Retail & recreation 2020-02-21 -2.761
6 Alabama Autauga Retail & recreation 2020-02-22 0.982
7 Alabama Autauga Retail & recreation 2020-02-23 -0.172
8 Alabama Autauga Retail & recreation 2020-02-24 -2.062
9 Alabama Autauga Retail & recreation 2020-02-25 9.883
df_ts.tail(10)
state county seg date value
33 Wyoming Weston Residential 2020-03-20 NaN
34 Wyoming Weston Residential 2020-03-21 NaN
35 Wyoming Weston Residential 2020-03-22 NaN
36 Wyoming Weston Residential 2020-03-23 NaN
37 Wyoming Weston Residential 2020-03-24 NaN
38 Wyoming Weston Residential 2020-03-25 NaN
39 Wyoming Weston Residential 2020-03-26 NaN
40 Wyoming Weston Residential 2020-03-27 NaN
41 Wyoming Weston Residential 2020-03-28 NaN
42 Wyoming Weston Residential 2020-03-29 NaN
### missing values because this county is one with little data
Let me know if I misunderstand.
Yes. That would be perfect.
File is linked above! It's available for you :)
Thank for the contribution! I will let you know when I have it into the dashboard.
@jwhendy thanks so much for posting! Do you plan to keep this up to date? Sounds like google updated the mobility reports. Thanks!
Thanks for the job, mates! But I recently explored "mobility-data-ts" file and found some serious errors. Here is a chunk of example: | State | County | seg | date | Value | Value from report |
---|---|---|---|---|---|---|
Ohio | Lake | Parks | 29.03.2020 | -42,665 | 146 | |
Ohio | Stark | Parks | 29.03.2020 | -17,958 | 167 | |
Ohio | Summit | Parks | 29.03.2020 | -42,459 | 139 | |
Ohio | Summit | Retail & recreation | 29.03.2020 | 139,002 | -42 | |
Ohio | Lawrence | Workplace | 29.03.2020 | 145,589 | -25 |
And, unfortunately, these are far from all mistakes. Please, try to fix issues, because in the current state it's impossible to use this data for analysis. Thanks!
Thanks for the discovery @ActiveConclusion. We do need to fix the errors. I will make a note on my dashboard as well. I will also re-create my archive as it is broken out by day for easier processing by D3 on the dashboard. I also wonder @jwhendy if the issues was on your end coming in from https://github.com/datasciencecampus/mobility-report-data-extractor/ If so, we should bring them into the discussion. Could be a bug that has been fixed on their end as well. I know they have been making updates. I can post on their repo. I wonder if re-running now with the original svg's might fix. Do you have those @jwhendy ?
It's probably me, but I haven't tracked it down yet. Just started looking. I checked Alabama and some from Wyoming, but obviously every state/county check is tedious. This is indeed helpful to have an example of where to look.
I do still have the svgs all downloaded.
Great, thanks. Let me know if I can help. I response to @dafriedman97, I noted that datasciencecampus's tools now directly downloads and cuts into csvs. I think there is new data as well. I am hoping it also has the original data as well. I may give that a try, but need a figure out a way to best merge new with old and preserve it as well.
Ok, I think I know why this is happening, but please continue to check. It's this known issue. That was open when I did this, but now it's closed so it might be fixed (just asked).
So, first note that the PDF value for a county should be the last point in the raw data (2020-03-29 as you've shown). Looking at my aggregate data which pulls those numbers directly from the PDFs, I have the correct value.
print(df_save[(df_save.state=='Ohio') & (df_save.county=='Lake')])
state county seg conf value i path
11274 Ohio Lake Retail & recreation 1.0 -47.0 259 output/US-Ohio/csv/259.csv
11275 Ohio Lake Grocery & pharmacy 1.0 -22.0 260 output/US-Ohio/csv/260.csv
11276 Ohio Lake Parks 1.0 146.0 261 output/US-Ohio/csv/261.csv
11277 Ohio Lake Transit stations 1.0 -43.0 262 output/US-Ohio/csv/262.csv
11278 Ohio Lake Workplace 1.0 -32.0 263 output/US-Ohio/csv/263.csv
11279 Ohio Lake Residential 1.0 10.0 264 output/US-Ohio/csv/264.csv
Pulling out Parks, specifically, this is US-Ohio/csv/261.csv
which has:
-24.095 | 2020-03-25 | 261
-24.69 | 2020-03-26 | 261
-37.077 | 2020-03-27 | 261
-52.746 | 2020-03-28 | 261
-42.665 | 2020-03-29 | 261
It's not always this simple as Summit county somehow switches retail and parks, but I suspect that we will find a delta of >80% is always involved in these. I'll fetch/merge and re-run to see if I get better results on these edge cases now that the linked bug is closed.
So, I'm matching up the file correctly and pulling out that value, it's just that the value is incorrect due to the bug.
Fetched/merged. Re-downloading/generating everything. Going to stick with 2020-03-29 just to see if this works on the issues above, then update to include what looks to be data through April 05.
Ok, still broken. The test case works, but I think it's because as of April 05 it's no longer over the 80% value which gives problems to their svg extractor:
state county seg conf value i path
10812 Ohio Lake Retail & recreation 1.0 -48.0 259 output/US-Ohio/csv/259.csv
10813 Ohio Lake Grocery & pharmacy 1.0 -19.0 260 output/US-Ohio/csv/260.csv
10814 Ohio Lake Parks 1.0 54.0 261 output/US-Ohio/csv/261.csv
10815 Ohio Lake Transit stations 1.0 -45.0 262 output/US-Ohio/csv/262.csv
10816 Ohio Lake Workplace 1.0 -34.0 263 output/US-Ohio/csv/263.csv
10817 Ohio Lake Residential 1.0 11.0 264 output/US-Ohio/csv/264.csv
End of time series for Ohio, Lake County, Parks (so, 54% change overall is correct):
39 Ohio Lake Parks 02/04/2020 90.298
40 Ohio Lake Parks 03/04/2020 92.537
41 Ohio Lake Parks 04/04/2020 153.896
42 Ohio Lake Parks 05/04/2020 53.676
Now, those are now < 80% so not a good test case. Ohio, Lucas is one:
state county seg conf value i path
10842 Ohio Lucas Retail & recreation 1.0 -49.0 289 output/US-Ohio/csv/289.csv
10843 Ohio Lucas Grocery & pharmacy 1.0 -18.0 290 output/US-Ohio/csv/290.csv
10844 Ohio Lucas Parks 1.0 111.0 291 output/US-Ohio/csv/291.csv
10845 Ohio Lucas Transit stations 1.0 -19.0 292 output/US-Ohio/csv/292.csv
10846 Ohio Lucas Workplace 1.0 -36.0 293 output/US-Ohio/csv/293.csv
10847 Ohio Lucas Residential 1.0 10.0 294 output/US-Ohio/csv/294.csv```
And time series for Ohio, Lucas, Parks:
38 Ohio Lucas Parks 01/04/2020 19.385
39 Ohio Lucas Parks 02/04/2020 161.953
40 Ohio Lucas Parks 03/04/2020 151.219
41 Ohio Lucas Parks 04/04/2020 89.579
42 Ohio Lucas Parks 05/04/2020 110.794
So, I think we're good. Keep in mind that those with conf==0
are still not trustworthy. Look at Ohio, Mahoning as an example. Has 116% in PDF, but the last values are all NaN in the csv, and most are nowhere close to 116 so not sure how they calculated that. Or they're still extracted badly from the code.
Also, we now have some missing states. I created issues at the extractor github for these.
Will upload updated .csvs shortly and let you know when they're live to pull from.
Data is uploaded to my repo. If @ActiveConclusion and @ladew222 could spot check to spread out the effort, that would be awesome.
Thanks, I will take a look, hopefully today. I decided to append the fips Ids info before uploading in R. After verified, I will re-run that and upload the files, and take away the note in the about section.
Do you have that somewhere? I was just bitten by the NY Times COVID data vs. Census data county naming. Like I found that one of them had "Virginia Beach" and another had "VIrginia Beach city". Do you have a robust way you're handling that? I did find fips to be more reliable, but not all datasets I'm working with have it... That said, state+county -> fips is non trivial if county naming is ambiguous.
I have some code in R and some in javascript that does the work. I use javascript because I get my covid data from Tom Quisel who sources from CBS which does not have fips. But for the google data, I had to split it up into multiple files in R, so I added the fips in there as well to save resources in JS. I am not sure I have the perfect way to do this fuzzy merge. What I am currently doing is using a file I have that has county name, state and fips with the county name without "county" in it. I then match state and search for the short name string in the county. But I am interested in improving accuracy here an the times may give you that. It looks like the times uses the fips for the count and then the state name. You would need a lookup table to convert. I can send you the one I have. You may also have to convert some #'s into 3 digit versions if your fips codes work that way. I use nhgis which uses and geoid that is fips plus some extra which needs to be stripped out.
Update: it didn't initially dawn on me that the release of new data (through 2020-04-05) would bring about a split in coverage periods. I've re-run my extraction code and now have date-based snapshots in my repo.
@ladew222 this will be a breaking naming convention if your dashboard is pulling from a URL!
New structure:
mobility-data-agg.csv
and mobility-data-ts.csv
are removed nowmobility-data-agg_{date}
: snapshot of latest mobility delta as of {date}
mobility-data-ts_{date}
: time series mobility data ~6 weeks prior to {date}
mobility-data-ts_all
: time series data combining both periods. The first release covered 2020-02-16 to 2020-03-29, the next covered 2020-02-23 to 2020-04-05. Since these overlap, I took only 2020-02-16 to 2020-02-22 from the first set in the event that the overlapping period had been updated/made more accurate in the more recent data. I tried both ways, and df1.equals(df2)
was false... so there were changes but I didn't dig into what was different. Just being transparent about how these were made, and the raw ts data is there if you'd like to dig further.Thanks. I do not pull directly as I split it into files based on date so the JS can handle it better. I will go through a process to split the new stuff. Was thinking this would happen. Thanks for documenting the change. I will start importing knowing this is the case.
@jwhendy I checked the file "mobility-data-ts_2020-03-29.csv", below is a small chunk of errors: | State | County | seg | Date | Value | Value from report |
---|---|---|---|---|---|---|
Minnesota | Sherburne | Residential | 29.03.2020 | -48,562 | 20 | |
Wisconsin | Marathon | Parks | 29.03.2020 | 10,639 | -50 | |
Wisconsin | Jefferson | Transit stations | 29.03.2020 | -12,283 | -62 | |
Wisconsin | Ozaukee | Parks | 29.03.2020 | -40,903 | 11 | |
California | Solano | Transit stations | 29.03.2020 | 9,116 | -37 |
@ActiveConclusion I'm not sure how to tackle these. Following up on two of them:
16454 Wisconsin Marathon Parks 1.0 -50.0 219 output/US-Wisconsin_2020-03-29/csv/219.csv
$ tail output/US-Wisconsin_2020-03-29/csv/219.csv
10.639,2020-03-29,219
1257 California Solano Transit stations 1.0 -37.0 274 output/US-California_2020-03-29/csv/274.csv
$ tail output/US-California_2020-03-29/csv/274.csv
9.116,2020-03-29,274
So, the data is correct, in a sense. It's a question of whether or not the csvs are accurate and/or if somehow the county/segment ordering I derive is off.
I was going to post an issue at the mobility report extractor, but not sure it's worth it. They have a full
option now, which basically does this. It's quite slow vs. mine so I avoided it, but they automatically list any differences between the ts data and the aggregate number and CA only features 1 (with Solano being correct at ~-37). I'm just going to abandon my homebrew data and re-run using their full
script. Will post back when that's done.
Thanks for looking into this. Once you have the data, I will work on pulling it in. I have the second pull in the system now. You can now display scatterplots on top of the map as their was a request for that feature and it makes some sense when looking at the mobility data as the real goals of isolation is to slow the curve growth. As such, you can now look at the curve and a log of the curve at the state and national level.
Ok, fixed this up. tl;dr: eEverything is based on the output of python mobius.py full
now. Files are updated in the same place.
While I was at it, I checked deltas myself (checking the headline value against the last time series value for each state/county/segment. I had two findings:
as a fluke, I originally checked the first headline vs. the last headline value and some were mismatched. I found out that these were due to some cities in the data with the same name as a county (Balitmore and Baltimore County both having entries). I tracked this down to ['Baltimore', 'St. Louis', 'Fairfax', 'Franklin', 'Richmond', 'Roanoke']
and removed all of those. Top level state data is also removed.
I also found genuine mismatches, but they are reduced (~100 for each date) and think upstream is aware, as they print out the following when running mobius.py full
:
Extracting plot summaries: 300it [00:22, 13.53it/s]
Extracting data from SVG plots: 100%|βββββββββββββββββ| 300/300 [00:00<00:00, 328.97it/s]
Plots with data: 164
Plots where last point doesn't match headline: 5
| | value | headline |
|:------------------------------------------------------------------------|--------:|-----------:|
| ('West Virginia April 5, 2020', 'Boone County', 'Workplace') | -22.503 | -22 |
| ('West Virginia April 5, 2020', 'Jackson County', 'Workplace') | -20.454 | -29 |
| ('West Virginia April 5, 2020', 'Logan County', 'Grocery & pharmacy') | 2.504 | 2 |
| ('West Virginia April 5, 2020', 'Ohio County', 'Transit stations') | -12.5 | -13 |
| ('West Virginia April 5, 2020', 'Wayne County', 'Retail & recreation') | -2.497 | -3 |
Plots where last point is more than 5 away: 1
| | value | headline |
|:----------------------------------------------------------------|--------:|-----------:|
| ('West Virginia April 5, 2020', 'Jackson County', 'Workplace') | -20 | -29 |
Saved full results to ./output/US-West_Virginia_2020-04-05.csv
My homebrew solution had ~600+ of these cases, so this is an improvement. For reference, I included two "errata" files with the last time series value per combo (value
) and the report value (headline
) and the absolute value of the difference (abs_delta
). If you wanted, you could match these up with the corresponding time series file and filter out any state/county/seg combos that have more than a certain error threshold.
Thanks. I will download tonight and build a process to merge the old and new timeframes. I will probably start with the newest and visit older files and grab what the newer files do not have. The interface has also gone through changes. There is now an option to overlay a scatterplot for timeframes and improved help documents in response to some user feedback.
The mobility-data-ts_all.csv
file merges both sets (spans 2020-02-16 through 2020-04-05). It opts for all of the 2020-04-05 dataset (2020-02-23 through 2020-04-05) and prepends the 2020-03-29 set for the earlier dates.
Feel free to grab that directly if it's easier.
Great. Perfect. Thanks. I got the file. I will let you know when I have the balefire.info updated. Should not take too long.
The files have been split and uploaded, I am going to change the date limits to match the new data. Currently, I can only go back to 3-14 as that is when Tom Quisel's data starts. They are discussing backfilling their repository with NY times data. When done, I well be able to go back further.
@ladew222 which dashboard has this?
There is now an option to overlay a scatterplot for timeframes and improved help documents in response to some user feedback.
Is yours the balefire one?
@jwhendy, thanks so much for doing this! One quick question: it seems that Washington DC is missing from the ts files. Is there any way to add it?
Absolutely. As an FYI,Yesterday, the ability to do multiple lines on one plot was also added. You can look at multiple state or counties although that gets messy as it does all counties in a state. Thanks for the catch. I will look through the merge and locate the problem and fix.
Well, since the problem is open in my repository, it's time for me to get it over with. There will be no time series parser in this repository because I am absolutely dissatisfied with the accuracy of my parser and other such parsers on GitHub. I think that soon we will see this data from Google in a normal way because currently, this is only an early release of reports. As for Apple reports, they are already in time series, so there is no issue.
@grigopop I had a look at this. It's because of this comment.
Time series data for each county. this does not have the US country data (country total, per-state levels)'
The way the data is fetched/stored, these state-level summaries occur when state==county
(or, in the column names of the raw data, when country==region
). Because DC is treated like this, my filter to remove state summaries also removed DC. I didn't have an obvious way to deal with this.
I've opted for the second route. To get around this, I uploaded -states
versions of each of my existing csv
files. DC is included in those. If you wanted to bring DC in like a county, just open both and append df_states_ts[df_states_ts['state'] == 'District of Columbia']
to the main ts data.
Edit: @ladew222 and @grigopop , if you'd prefer a better way to merge all of this data into one or deal with the state summary vs. county value issue, I'm open to suggestions!
@ActiveConclusion I wouldn't feel bad about this. In checking after you found error in my data, there were ~700 state/county/seg combinations that didn't agree. Using the upstream parser (mobius.py full
), this is down to ~120 (published in the errata files I mentioned above). This amounts to 120 errors in almost 17k entries (~0.7%). I's a frustrating issue, but I I think/hope that overall this data is still very usable.
I agree. I am running into some problems of my own with the data which I believe are on my end. For some reason the choropleth is loosing variation on the newer data(confirmed) when NY is included. Something with high relative numbers I am guessing. I may put a note for current users of the site. State by state is fine
Finally, Google published a CSV file with time series
Bummer, no asterisk column so you can't remove locations with low confidence.
Thanks so much for posting this! I'm hoping to track mobility by date... the graphs Google released clearly show trends by county by date. Do you know of any way to scrape or otherwise obtain that information? Thanks