Time series data - Githubissues

dafriedman97 commented 4 years ago

Thanks so much for posting this! I'm hoping to track mobility by date... the graphs Google released clearly show trends by county by date. Do you know of any way to scrape or otherwise obtain that information? Thanks

ladew222 commented 4 years ago

Thanks here as well. I am looking at bringing it into the data explorer that is currently being built here:. Time series would be great, but some of this could be achieved by regularly pulling the data, I imagine. There is a project working on the time series here:https://github.com/datasciencecampus/mobility-report-data-extractor. It does not have location information, however, attached to the exports, which makes it hard to join..

ActiveConclusion commented 4 years ago

I didn't test this code, but the author claims that his script allows you to collect data from graphs.

dafriedman97 commented 4 years ago

I'll try it out and let you know. Thanks so much!

ladew222 commented 4 years ago

The only downside is that it requires manual download. Also, the first group is now looking to add place-names as well, but they are manual download as well. I am using this data in my dashboard currently but would love to have time-series eventually.

jwhendy commented 4 years ago

@ladew222 followed you here from the otherissue. Would you like a big 'ol csv of all US county data? I got as far as extracting the county names and order, which lets me associate them to the correct ./output/US-*/csv/n.csv. If I looped through and merged them all, I could just share that .csv directly...

ladew222 commented 4 years ago

That would be amazing @jwhendy. It would be great to have time series as my tool uses correlations based on COVID data from https://github.com/tomquisel/covid19-data as well as census data--and users would be able to see how this behavior is correlating on particular dates at the state and county level.

jwhendy commented 4 years ago

I didn't wait :) Working on it now, though had a glitch I caught in multi-named places (US-New, US-South).

Interesting on what you're doing... my brother and I just started putzing on some analysis as well. I had very similar ideas, wondering if there were trends that could be identified. Our basic aim was to create an "anomaly" or "outlier" detection system. In this way, say, counties with similar characteristics could be compared, and perhaps local officials could be alerted when things looked "much worse" than similar counties.

To do that, I realized, one needs at least some understanding of what a county should look like, which was tough. I've been bringing in census data as well. So far I have age, income, population and land (latter two allowing us to calc population density). % mobility seemed to be the next logical thing.

Honestly if that fails, I think I'll concede that our data/testing situation is so bad, nothing can be said at all. As a test case, if we can't differentiate similar counties between SF (very early stay at home) and, say, FL (April 3)... either staying at home has zero effect, the model is so complex that even this intuitively-dominant variable isn't predictive, or the data is utterly garbage.

Thoughts? And if we're kind of doing the same thing (overall visualizer/explorer)... interest in working on that together?

jwhendy commented 4 years ago

Here you go!

mobility-data-agg: the high level value pulled from the pdf for each state/county; note that this also contains the state-level averages (df.county=='summary') and the states themselves (df.area=='United States') so there's redundancy in that set
mobilty-data-ts: time series data for each county. this does not have the US country data (country total, per-state levels) or the high level state summaries from pg1-pg2 of each PDF. It only has county level data.

mobility-by-segment

Let me know if you find any errors. Currently not verified, but the plots are cool :) Look at the dual modes of residential and workplace, which might indicate those are the most robust indicators of what people are doing?

ladew222 commented 4 years ago

Cool. Just read your message. We have similar goals, although your work is more statistically grounded. I did some early analysis in R, but then moved in the direction of visualization for general consumption which led me to use D3. I allows users to view strength of correlation using a Pearson JS library but it not comprehensive in that regard. If the stats got more intense I would need to use a Python backend server most likely, which I do in other cases. Basically this visual is a way for people to look at Choropleth and visualize a bunch of variables that we are all thinking about. But I think your work is a complement as it is designed for different purposes and a different audiences. Great idea. I will take a look. Thanks

jwhendy commented 4 years ago

I think I'm doing something similar, basically trying to visualize various relationships to see what matters. I think I'm just doing that via scatter and line plots vs. you using maps. Overall, my "statistical" approach is a rather blind hunting for obvious correlation.

So far, they aren't really jumping out at me. I think if any visualizer (or statistical analysis) is useful (vs. just being interesting to look at), it needs to provide some explanation for, say, CA vs. FL. Or NY vs. others.

If the goal is just to see a dashboard, the above is probably irrelevant. If the goal is after some sort of insights, however, I thought it best to start at the extremes (low hanging fruit) and see what variables help make sense of these things. Admittedly the data might not exist... like which data set contains "number of huge COVID-infested cruise ships docking per state"?!

Cool dashboards, btw! I hadn't actually looked, but just did.

ladew222 commented 4 years ago

Thanks. And let me know if you see anything that does not make sense or something is missing. BTW, do you have a non-summerized version of the data you extracted, where the values for each county on a date can be discerned. I would need this to integrate it into the county level map. If not, I can go back and work on that one.

jwhendy commented 4 years ago

Hmmm. So you mean one entry per county per category per date? If so, is this seems like what you want? (I admit I had a filter on the wrong column name which didn't remove state level data, but the county level stuff was still there. Edit: that's fixed now; summary rows and overall US/state data is removed now).

df_ts.head(10)
     state   county                  seg        date  value
0  Alabama  Autauga  Retail & recreation  2020-02-16  0.173
1  Alabama  Autauga  Retail & recreation  2020-02-17  7.732
2  Alabama  Autauga  Retail & recreation  2020-02-18 -1.675
3  Alabama  Autauga  Retail & recreation  2020-02-19 -1.663
4  Alabama  Autauga  Retail & recreation  2020-02-20 -7.739
5  Alabama  Autauga  Retail & recreation  2020-02-21 -2.761
6  Alabama  Autauga  Retail & recreation  2020-02-22  0.982
7  Alabama  Autauga  Retail & recreation  2020-02-23 -0.172
8  Alabama  Autauga  Retail & recreation  2020-02-24 -2.062
9  Alabama  Autauga  Retail & recreation  2020-02-25  9.883

df_ts.tail(10)
      state  county          seg        date  value
33  Wyoming  Weston  Residential  2020-03-20    NaN
34  Wyoming  Weston  Residential  2020-03-21    NaN
35  Wyoming  Weston  Residential  2020-03-22    NaN
36  Wyoming  Weston  Residential  2020-03-23    NaN
37  Wyoming  Weston  Residential  2020-03-24    NaN
38  Wyoming  Weston  Residential  2020-03-25    NaN
39  Wyoming  Weston  Residential  2020-03-26    NaN
40  Wyoming  Weston  Residential  2020-03-27    NaN
41  Wyoming  Weston  Residential  2020-03-28    NaN
42  Wyoming  Weston  Residential  2020-03-29    NaN

### missing values because this county is one with little data

Let me know if I misunderstand.

ladew222 commented 4 years ago

Yes. That would be perfect.

jwhendy commented 4 years ago

File is linked above! It's available for you :)

ladew222 commented 4 years ago

Thank for the contribution! I will let you know when I have it into the dashboard.

dafriedman97 commented 4 years ago

@jwhendy thanks so much for posting! Do you plan to keep this up to date? Sounds like google updated the mobility reports. Thanks!

ActiveConclusion commented 4 years ago

Thanks for the job, mates! But I recently explored "mobility-data-ts" file and found some serious errors. Here is a chunk of example:	State	County	seg	date	Value
Ohio	Lake	Parks	29.03.2020	-42,665	146
Ohio	Stark	Parks	29.03.2020	-17,958	167
Ohio	Summit	Parks	29.03.2020	-42,459	139
Ohio	Summit	Retail & recreation	29.03.2020	139,002	-42
Ohio	Lawrence	Workplace	29.03.2020	145,589	-25

And, unfortunately, these are far from all mistakes. Please, try to fix issues, because in the current state it's impossible to use this data for analysis. Thanks!

ladew222 commented 4 years ago

Thanks for the discovery @ActiveConclusion. We do need to fix the errors. I will make a note on my dashboard as well. I will also re-create my archive as it is broken out by day for easier processing by D3 on the dashboard. I also wonder @jwhendy if the issues was on your end coming in from https://github.com/datasciencecampus/mobility-report-data-extractor/ If so, we should bring them into the discussion. Could be a bug that has been fixed on their end as well. I know they have been making updates. I can post on their repo. I wonder if re-running now with the original svg's might fix. Do you have those @jwhendy ?

jwhendy commented 4 years ago

It's probably me, but I haven't tracked it down yet. Just started looking. I checked Alabama and some from Wyoming, but obviously every state/county check is tedious. This is indeed helpful to have an example of where to look.

I do still have the svgs all downloaded.

ladew222 commented 4 years ago

Great, thanks. Let me know if I can help. I response to @dafriedman97, I noted that datasciencecampus's tools now directly downloads and cuts into csvs. I think there is new data as well. I am hoping it also has the original data as well. I may give that a try, but need a figure out a way to best merge new with old and preserve it as well.

jwhendy commented 4 years ago

Ok, I think I know why this is happening, but please continue to check. It's this known issue. That was open when I did this, but now it's closed so it might be fixed (just asked).

So, first note that the PDF value for a county should be the last point in the raw data (2020-03-29 as you've shown). Looking at my aggregate data which pulls those numbers directly from the PDFs, I have the correct value.

print(df_save[(df_save.state=='Ohio') & (df_save.county=='Lake')])
      state county                  seg  conf  value    i                        path
11274  Ohio   Lake  Retail & recreation   1.0  -47.0  259  output/US-Ohio/csv/259.csv
11275  Ohio   Lake   Grocery & pharmacy   1.0  -22.0  260  output/US-Ohio/csv/260.csv
11276  Ohio   Lake                Parks   1.0  146.0  261  output/US-Ohio/csv/261.csv
11277  Ohio   Lake     Transit stations   1.0  -43.0  262  output/US-Ohio/csv/262.csv
11278  Ohio   Lake            Workplace   1.0  -32.0  263  output/US-Ohio/csv/263.csv
11279  Ohio   Lake          Residential   1.0   10.0  264  output/US-Ohio/csv/264.csv

Pulling out Parks, specifically, this is US-Ohio/csv/261.csv which has:

-24.095 | 2020-03-25 | 261
-24.69 | 2020-03-26 | 261
-37.077 | 2020-03-27 | 261
-52.746 | 2020-03-28 | 261
-42.665 | 2020-03-29 | 261

It's not always this simple as Summit county somehow switches retail and parks, but I suspect that we will find a delta of >80% is always involved in these. I'll fetch/merge and re-run to see if I get better results on these edge cases now that the linked bug is closed.

So, I'm matching up the file correctly and pulling out that value, it's just that the value is incorrect due to the bug.

jwhendy commented 4 years ago

Fetched/merged. Re-downloading/generating everything. Going to stick with 2020-03-29 just to see if this works on the issues above, then update to include what looks to be data through April 05.

jwhendy commented 4 years ago

Ok, still broken. The test case works, but I think it's because as of April 05 it's no longer over the 80% value which gives problems to their svg extractor:

      state county                  seg  conf  value    i                        path
10812  Ohio   Lake  Retail & recreation   1.0  -48.0  259  output/US-Ohio/csv/259.csv
10813  Ohio   Lake   Grocery & pharmacy   1.0  -19.0  260  output/US-Ohio/csv/260.csv
10814  Ohio   Lake                Parks   1.0   54.0  261  output/US-Ohio/csv/261.csv
10815  Ohio   Lake     Transit stations   1.0  -45.0  262  output/US-Ohio/csv/262.csv
10816  Ohio   Lake            Workplace   1.0  -34.0  263  output/US-Ohio/csv/263.csv
10817  Ohio   Lake          Residential   1.0   11.0  264  output/US-Ohio/csv/264.csv

End of time series for Ohio, Lake County, Parks (so, 54% change overall is correct):

39  Ohio   Lake  Parks  02/04/2020   90.298
40  Ohio   Lake  Parks  03/04/2020   92.537
41  Ohio   Lake  Parks  04/04/2020  153.896
42  Ohio   Lake  Parks  05/04/2020   53.676

Now, those are now < 80% so not a good test case. Ohio, Lucas is one:

      state county                  seg  conf  value    i                        path
10842  Ohio  Lucas  Retail & recreation   1.0  -49.0  289  output/US-Ohio/csv/289.csv
10843  Ohio  Lucas   Grocery & pharmacy   1.0  -18.0  290  output/US-Ohio/csv/290.csv
10844  Ohio  Lucas                Parks   1.0  111.0  291  output/US-Ohio/csv/291.csv
10845  Ohio  Lucas     Transit stations   1.0  -19.0  292  output/US-Ohio/csv/292.csv
10846  Ohio  Lucas            Workplace   1.0  -36.0  293  output/US-Ohio/csv/293.csv
10847  Ohio  Lucas          Residential   1.0   10.0  294  output/US-Ohio/csv/294.csv```

And time series for Ohio, Lucas, Parks:

38  Ohio  Lucas  Parks  01/04/2020   19.385
39  Ohio  Lucas  Parks  02/04/2020  161.953
40  Ohio  Lucas  Parks  03/04/2020  151.219
41  Ohio  Lucas  Parks  04/04/2020   89.579
42  Ohio  Lucas  Parks  05/04/2020  110.794

So, I think we're good. Keep in mind that those with conf==0 are still not trustworthy. Look at Ohio, Mahoning as an example. Has 116% in PDF, but the last values are all NaN in the csv, and most are nowhere close to 116 so not sure how they calculated that. Or they're still extracted badly from the code.

Also, we now have some missing states. I created issues at the extractor github for these.

Will upload updated .csvs shortly and let you know when they're live to pull from.

jwhendy commented 4 years ago

Data is uploaded to my repo. If @ActiveConclusion and @ladew222 could spot check to spread out the effort, that would be awesome.

ladew222 commented 4 years ago

Thanks, I will take a look, hopefully today. I decided to append the fips Ids info before uploading in R. After verified, I will re-run that and upload the files, and take away the note in the about section.

jwhendy commented 4 years ago

Do you have that somewhere? I was just bitten by the NY Times COVID data vs. Census data county naming. Like I found that one of them had "Virginia Beach" and another had "VIrginia Beach city". Do you have a robust way you're handling that? I did find fips to be more reliable, but not all datasets I'm working with have it... That said, state+county -> fips is non trivial if county naming is ambiguous.

ladew222 commented 4 years ago

I have some code in R and some in javascript that does the work. I use javascript because I get my covid data from Tom Quisel who sources from CBS which does not have fips. But for the google data, I had to split it up into multiple files in R, so I added the fips in there as well to save resources in JS. I am not sure I have the perfect way to do this fuzzy merge. What I am currently doing is using a file I have that has county name, state and fips with the county name without "county" in it. I then match state and search for the short name string in the county. But I am interested in improving accuracy here an the times may give you that. It looks like the times uses the fips for the count and then the state name. You would need a lookup table to convert. I can send you the one I have. You may also have to convert some #'s into 3 digit versions if your fips codes work that way. I use nhgis which uses and geoid that is fips plus some extra which needs to be stripped out.

jwhendy commented 4 years ago

Update: it didn't initially dawn on me that the release of new data (through 2020-04-05) would bring about a split in coverage periods. I've re-run my extraction code and now have date-based snapshots in my repo.

@ladew222 this will be a breaking naming convention if your dashboard is pulling from a URL!

New structure:

mobility-data-agg.csv and mobility-data-ts.csv are removed now
mobility-data-agg_{date}: snapshot of latest mobility delta as of {date}
mobility-data-ts_{date}: time series mobility data ~6 weeks prior to {date}
mobility-data-ts_all: time series data combining both periods. The first release covered 2020-02-16 to 2020-03-29, the next covered 2020-02-23 to 2020-04-05. Since these overlap, I took only 2020-02-16 to 2020-02-22 from the first set in the event that the overlapping period had been updated/made more accurate in the more recent data. I tried both ways, and df1.equals(df2) was false... so there were changes but I didn't dig into what was different. Just being transparent about how these were made, and the raw ts data is there if you'd like to dig further.

ladew222 commented 4 years ago

Thanks. I do not pull directly as I split it into files based on date so the JS can handle it better. I will go through a process to split the new stuff. Was thinking this would happen. Thanks for documenting the change. I will start importing knowing this is the case.

ActiveConclusion commented 4 years ago

@jwhendy I checked the file "mobility-data-ts_2020-03-29.csv", below is a small chunk of errors:	State	County	seg	Date	Value
Minnesota	Sherburne	Residential	29.03.2020	-48,562	20
Wisconsin	Marathon	Parks	29.03.2020	10,639	-50
Wisconsin	Jefferson	Transit stations	29.03.2020	-12,283	-62
Wisconsin	Ozaukee	Parks	29.03.2020	-40,903	11
California	Solano	Transit stations	29.03.2020	9,116	-37

jwhendy commented 4 years ago

@ActiveConclusion I'm not sure how to tackle these. Following up on two of them:

16454  Wisconsin  Marathon                Parks   1.0  -50.0  219  output/US-Wisconsin_2020-03-29/csv/219.csv

$ tail output/US-Wisconsin_2020-03-29/csv/219.csv 
10.639,2020-03-29,219

1257  California  Solano     Transit stations   1.0  -37.0  274  output/US-California_2020-03-29/csv/274.csv

$ tail output/US-California_2020-03-29/csv/274.csv
9.116,2020-03-29,274

So, the data is correct, in a sense. It's a question of whether or not the csvs are accurate and/or if somehow the county/segment ordering I derive is off.

I was going to post an issue at the mobility report extractor, but not sure it's worth it. They have a full option now, which basically does this. It's quite slow vs. mine so I avoided it, but they automatically list any differences between the ts data and the aggregate number and CA only features 1 (with Solano being correct at ~-37). I'm just going to abandon my homebrew data and re-run using their full script. Will post back when that's done.

ladew222 commented 4 years ago

Thanks for looking into this. Once you have the data, I will work on pulling it in. I have the second pull in the system now. You can now display scatterplots on top of the map as their was a request for that feature and it makes some sense when looking at the mobility data as the real goals of isolation is to slow the curve growth. As such, you can now look at the curve and a log of the curve at the state and national level.

jwhendy commented 4 years ago

Ok, fixed this up. tl;dr: eEverything is based on the output of python mobius.py full now. Files are updated in the same place.

While I was at it, I checked deltas myself (checking the headline value against the last time series value for each state/county/segment. I had two findings:

as a fluke, I originally checked the first headline vs. the last headline value and some were mismatched. I found out that these were due to some cities in the data with the same name as a county (Balitmore and Baltimore County both having entries). I tracked this down to ['Baltimore', 'St. Louis', 'Fairfax', 'Franklin', 'Richmond', 'Roanoke'] and removed all of those. Top level state data is also removed.
I also found genuine mismatches, but they are reduced (~100 for each date) and think upstream is aware, as they print out the following when running mobius.py full:

Extracting plot summaries: 300it [00:22, 13.53it/s]
Extracting data from SVG plots: 100%|█████████████████| 300/300 [00:00<00:00, 328.97it/s]
Plots with data: 164 
Plots where last point doesn't match headline: 5
|                                                                         |   value |   headline |
|:------------------------------------------------------------------------|--------:|-----------:|
| ('West Virginia  April 5, 2020', 'Boone County', 'Workplace')           | -22.503 |        -22 |
| ('West Virginia  April 5, 2020', 'Jackson County', 'Workplace')         | -20.454 |        -29 |
| ('West Virginia  April 5, 2020', 'Logan County', 'Grocery & pharmacy')  |   2.504 |          2 |
| ('West Virginia  April 5, 2020', 'Ohio County', 'Transit stations')     | -12.5   |        -13 |
| ('West Virginia  April 5, 2020', 'Wayne County', 'Retail & recreation') |  -2.497 |         -3 |
Plots where last point is more than 5 away: 1
|                                                                 |   value |   headline |
|:----------------------------------------------------------------|--------:|-----------:|
| ('West Virginia  April 5, 2020', 'Jackson County', 'Workplace') |     -20 |        -29 |
Saved full results to ./output/US-West_Virginia_2020-04-05.csv

My homebrew solution had ~600+ of these cases, so this is an improvement. For reference, I included two "errata" files with the last time series value per combo (value) and the report value (headline) and the absolute value of the difference (abs_delta). If you wanted, you could match these up with the corresponding time series file and filter out any state/county/seg combos that have more than a certain error threshold.

ladew222 commented 4 years ago

Thanks. I will download tonight and build a process to merge the old and new timeframes. I will probably start with the newest and visit older files and grab what the newer files do not have. The interface has also gone through changes. There is now an option to overlay a scatterplot for timeframes and improved help documents in response to some user feedback.

jwhendy commented 4 years ago

The mobility-data-ts_all.csv file merges both sets (spans 2020-02-16 through 2020-04-05). It opts for all of the 2020-04-05 dataset (2020-02-23 through 2020-04-05) and prepends the 2020-03-29 set for the earlier dates.

Feel free to grab that directly if it's easier.

ladew222 commented 4 years ago

Great. Perfect. Thanks. I got the file. I will let you know when I have the balefire.info updated. Should not take too long.

ladew222 commented 4 years ago

The files have been split and uploaded, I am going to change the date limits to match the new data. Currently, I can only go back to 3-14 as that is when Tom Quisel's data starts. They are discussing backfilling their repository with NY times data. When done, I well be able to go back further.

jwhendy commented 4 years ago

@ladew222 which dashboard has this?

There is now an option to overlay a scatterplot for timeframes and improved help documents in response to some user feedback.

Is yours the balefire one?

grigopop commented 4 years ago

@jwhendy, thanks so much for doing this! One quick question: it seems that Washington DC is missing from the ts files. Is there any way to add it?

ladew222 commented 4 years ago

Absolutely. As an FYI,Yesterday, the ability to do multiple lines on one plot was also added. You can look at multiple state or counties although that gets messy as it does all counties in a state. Thanks for the catch. I will look through the merge and locate the problem and fix.

ActiveConclusion commented 4 years ago

Well, since the problem is open in my repository, it's time for me to get it over with. There will be no time series parser in this repository because I am absolutely dissatisfied with the accuracy of my parser and other such parsers on GitHub. I think that soon we will see this data from Google in a normal way because currently, this is only an early release of reports. As for Apple reports, they are already in time series, so there is no issue.

jwhendy commented 4 years ago

@grigopop I had a look at this. It's because of this comment.

Time series data for each county. this does not have the US country data (country total, per-state levels)'

The way the data is fetched/stored, these state-level summaries occur when state==county (or, in the column names of the raw data, when country==region). Because DC is treated like this, my filter to remove state summaries also removed DC. I didn't have an obvious way to deal with this.

keep all the data together, in which case there are redundancies on any stats or aggregates calculated, as you have all the individual points for a state and the average that's already calculated
remove states and let the user calculate per-state values on their own.

I've opted for the second route. To get around this, I uploaded -states versions of each of my existing csv files. DC is included in those. If you wanted to bring DC in like a county, just open both and append df_states_ts[df_states_ts['state'] == 'District of Columbia'] to the main ts data.

Edit: @ladew222 and @grigopop , if you'd prefer a better way to merge all of this data into one or deal with the state summary vs. county value issue, I'm open to suggestions!

@ActiveConclusion I wouldn't feel bad about this. In checking after you found error in my data, there were ~700 state/county/seg combinations that didn't agree. Using the upstream parser (mobius.py full), this is down to ~120 (published in the errata files I mentioned above). This amounts to 120 errors in almost 17k entries (~0.7%). I's a frustrating issue, but I I think/hope that overall this data is still very usable.

ladew222 commented 4 years ago

I agree. I am running into some problems of my own with the data which I believe are on my end. For some reason the choropleth is loosing variation on the newer data(confirmed) when NY is included. Something with high relative numbers I am guessing. I may put a note for current users of the site. State by state is fine

ActiveConclusion commented 4 years ago

Finally, Google published a CSV file with time series

jwhendy commented 4 years ago

Bummer, no asterisk column so you can't remove locations with low confidence.

ActiveConclusion / COVID19_mobility

Time series data #1