Closed jingjtang closed 3 years ago
@korlaxxalrok for automation
@jingjtang are you working on the documentation/signal review stuff already?
@jingjtang are you working on the documentation/signal review stuff already?
Not yet. Currently working on other staffs (mostly google-symptoms). And my computer does not allow me to run on the full dataset. Will either wait for @korlaxxalrok's automation or get a bigchunk to run to get some outputs for the correlation analysis. Since Brian is working on other staffs now, we might not have the automation very soon. I plan to start it this weekend. If you are interested or have time to do the correlation analysis before that, it would be great. Actually I am not sure whether I should ping you here (maybe not. Here it seems you are automatically added).
@krivard A simple state-level comparison here. Something to consider:
bars_visit_prop
and restaurants_visit_prop
are scaled by the max(bars_visits_prop$value.max(), restaurants_visit_prop$value.max()). This makes the curves comparable between states. And it is also comparable between bars_visit_prop and restauratns_visit_prop. The values are now in arbitrary unit and cannot be compared between safegraph-patterns signals and USA-Facts incidence prop.For most of the states, number of visits to restaurants is much larger than the number of visits to bars (normalized by population) except for LA from late May to late July. People do show awareness of protecting themselves by reducing the number of visits to bars/restaurants in early March to some extent. But after May, or even earlier in some states like FL, AL, IA etc, the number of visits to bars/restaurants gradually increases though the number of confirmed cases also increases. safegraph_patterns_compare_with_usafacts.pdf
(Edited on 11-17)
Other things to consider:
bars_visit_num
bars_visit_prop
restaurants_visit_num
restaurants_visit_prop
For the other things to consider:
is there a policy to release smoothed signals? If not seems easier to release the raw one first.
No policy, go ahead with just the raw ones and we can add smoothed by request at a later date.
What's the earliest date that we want to report to the public?
February 1 2020 is the standard we've been using for all signals that go back that far or earlier.
Once a preview of this data is available, we should probably ping Larry, since he's interested in using it as soon as it is ready.
@capnrefsmmat A simple state level visual view here. Will have comprehensive visual view once it is automated with approved source name and signal names.
In addition to the time series comparison, the geo-wise correlation analysis is more interesting. Correlating usa-facts: confirmed_7dav_incidence_prop averaged from 2020-10-23 to 2020-10-25 against safegraph_patterns: restaurants_visit_prop from 2020-10-23 to 2020-10-25
MSA level
County Level
The restaurants_visit_prop
is much more powerful than the bars_visit_prop
especially in populated areas at MSA/County level.
@RoniRos can you please review the signal names here.
@capnrefsmmat are the correlations above adequate for statistical review?
I think so. The main motivation here is for a project studying mobility and behavior specifically, so even if the correlations weren't great, that would be an important finding.
Fabulous. Marking statistical correlation as done.
@jingjtang can you please also prepare mailing list and signal popup drafts?
@benjaminysmith I think @Akvannortwick is in charge of the mailing list. I will do the signal popup drafts soon. Thanks for the reminder.
@benjaminysmith I will start the drafts and can link you to the google doc once I have something @jingjtang I am taking a look right now
@benjaminysmith I have a draft of the email, found here. If there are any necessary changes or suggestions that anyone wishes to implement, please use the suggest function. I will send it out once it has been reviewed and I recieve word to do so.
Thanks @Akvannortwick , I am reviewing it now.
I see @krivard's question above about what source name to use, but do not see any further discussion. Is there a reason not to use the same source name as the other SafeGraph indicators, namely 'SafeGraph'? There is a reference in the documentation to 'SafeGraph Mobility' vs. 'SafeGraph Patterns', but the official source name for the former is still 'SafeGraph', I believe. @huisaddison : what is the significance of 'SafeGraph Mobility' vs. 'SafeGraph Patterns'? are these just two separate SafeGraph DBs?
One more thing, regarding how far back these indicators should go: According to the documentation, these are supposedly absolute numbers of visitors (normalized by population only. It is important to be able to intretpret these numbers relative to pre-pandemic levels. We could do it in one of two ways:
@RoniRos Our Safegraph documentation: no, we do not discuss this. I don't believe Safegraph's own documentation discusses this either (I assume their philosophy is that the end-user can do their own normalization, which is more transparent). Google's pre-normalization has actually been annoying for the modeling team, since the normalization is described in vague terms on their website.
I am in favor of making the signal available for 2019 as well, and letting the end-user perform the normalization.
@RoniRos Safegraph Mobility vs Safegraph Patterns. The way we landed on this is very confusing
Safegraph produces a number of datasets. We took their social distancing signals, and started referring it to "Safegraph Mobility" because they measure mobility, and are from safegraph (as opposed to "Google Mobility" signals that we never built a pipeline for)
Then we took their patterns signals (named so presumably because they measure _consumer patterns). And I think that we inherited the "patterns" name when @jingjtang wrote the new pipeline to process this distinct dataset.
We could rename our Safegraph [Mobility] to Safegraph Social Distancing for uniformity, if desired... https://docs.safegraph.com/docs
I made a separate pipeline because the format of the raw dataset is very different. In SafeGraph Mobility (The Social Distancing dataset), we read a series of files ~500MB and each file contains data for a single day. Each row of a file represents a census block group (please correct me if I am wrong @huisaddison). The signals that we are interested in can be directly got from the columns such as 'median_home_dwell_time'.
However in SafeGraph Patterns (the Weekly Patterns dataset), we should read a series of large files (>1GB each before gunzip), each one contains the data for a specific week. In one file, each line represents a specific place. And there is a special column visit_by_day
which is in the format of [num1, num2, num3, ..., num7]
. (num X represents the number of visits to that place in day X of that week). Besides, since we only consider bars and restaurants related places currently, there is also a filtration based on the naics code provided in another dataset "Core Places" which is also super large. (And I choose to store only the files that are needed in ./static
folder of the safegraph_patterns pipeline to save memory. This is because the mapping from places to naics_code is not updated frequently.)
I think users won't care too much about how we pre-process the raw dataset. So, I prefer not to combine those two pipelines but naming both of the source to be safegraph
is fine to me.
@jingjtang's argument makes sense to me, too. From a user's perspective, these are all from the same source, 'SafeGraph', in the sense in which we usually mean the term 'source' in our API.
If we accept this, then the API documentation needs to be updated, including removing reference to 'Safegraph Mobility' as a source name (but it can remain as reference to specific DBs within SafeGraph).
@RoniRos The normalization for signals here from the SafeGraph Patterns is a problem. The raw dataset do not provide any other daily values except for the number of visits.
The only thing that can possibly serve as the denominator is raw_visitor_count
which is the number of unique visitors to a specific place in a specific week, problem for this one is
The raw_visitor_count
is not consistent across time. So, it does cause problem to the current normalization, but do not have a better solution.
@Akvannortwick @benjaminysmith The draft of the signal description pop-up text here under Name: People’s Visits to Bars
and Name: People’s Visits to Restaurants
@jingjtang Thanks for the link, I will take a look at it and make any suggestions if there are any.
@jingjtang -- to make sure I understand, is your argument that we should not attempt normalization due to those blockers (and Addison supports this regardless)?
Also @jingjtang, per Roni:
If we accept this, then the API documentation needs to be updated, including removing reference to 'Safegraph Mobility' as a source name (but it can remain as reference to specific DBs within SafeGraph).
It looks like you already have done this in https://github.com/cmu-delphi/delphi-epidata/pull/274/files. Can you please confirm?
Do you also have some quick charts for a visual review?
@jingjtang -- to make sure I understand, is your argument that we should not attempt normalization due to those blockers (and Addison supports this regardless)?
Currently we only do the normalization based on population. It hasn't been talked in the code review. But if you have any suggestions @huisaddison, that would be very helpful.
It looks like you already have done this in https://github.com/cmu-delphi/delphi-epidata/pull/274/files. Can you please confirm?
Yes. I accepted this and updated the PR for the API documentation. If @krivard has other concerns, we can change it back.
Do you also have some quick charts for a visual review?
Yes, a view view is uploaded here
I think that raw counts and pop normed is fine. Normalizing for a "non-COVID time range" can be done by the end user, assuming we provide historical data starting in January 2019.
I happened to look at the (now merged PR) that was linked in this issue, and I found a typo: https://github.com/cmu-delphi/covidcast-indicators/pull/225#discussion_r526257756 I commented in the PR, not realizing it was already merged. If the typo is still in the (merged) DETAILS.md
can it please be fixed? (It mentions deaths; I assume the DETAILS.md
for a deaths indicator was used as a template).
@huisaddison Thanks, I will fix that.
@jingjtang thanks sorry I missed that.
@capnrefsmmat can you please take a look at and approve in comments here (if it looks good to you) the visual review here.
The visual review looks plausible, though I wonder why Alabama has so many more restaurant visits than other states. Seems a little strange, but if that's really in the source data, I guess that's how it is.
Agreed -- Alabama on 2020-03-07 is 50% higher than the surrounding states for restaurants, and looks like there is a similar effect for Louisiana for bars -- e.g. on 2020-07-18 the bar prop is 1364 vs e.g, 36 for Texas.
@jingjtang can you spot check these?
@capnrefsmmat @benjaminysmith This notebook shows how the source data is.
The number of POIs (either bar or restaurant) varies a lot across regions but remains nearly constant over time, except for the change point week 2020-06-09 to 2020-06-15 (due to lockdown). In general, the number of bars is much less than the number of restaurants in SafeGraph Core Places dataset. There are 20 unique brands considered for bars while 962 brands considered for restaurants.
The number of restaurants considered in AL is quite large than other states but the number of bars considered in quite small.
@benjaminysmith @jingjtang I am planning on putting safegraph_patterns onto prod today. The code is there, but I still need to schedule it and run the first manual ingest.
Is there any reason to hold off on doing this?
Update: we decided to hold off on pushing to prod until Monday but this is currently in flight.
safegraph_patterns has had its initial run, data ingested, and is now scheduled for Thursdays at 12:02 (one minute after safegraph).
This is a new indicator that is also based on SafeGraph Mobility data as the older one but focusing on weekly patterns of people's visits in to places. This indicator tracks the number of visits to bars and restaurants daily as published by SafeGraph. Signal is per day, but the estimate is updated weekly.