Release SafeGraph_Patterns

jingjtang commented 3 years ago

Link to issue
Link to PR
Proposed release version: not sure yet

This is a new indicator that is also based on SafeGraph Mobility data as the older one but focusing on weekly patterns of people's visits in to places. This indicator tracks the number of visits to bars and restaurants daily as published by SafeGraph. Signal is per day, but the estimate is updated weekly.

[x] API documentation and/or changelog
[x] API mailing list notification

[x] Statistical review (usually correlations)
[x] Signal / source name review (usually Roni)

[x] Visual review
[x] Signal description pop-up text review
[x] Map release notes

jingjtang commented 3 years ago

@korlaxxalrok for automation

benjaminysmith commented 3 years ago

@jingjtang are you working on the documentation/signal review stuff already?

jingjtang commented 3 years ago

@jingjtang are you working on the documentation/signal review stuff already?

Not yet. Currently working on other staffs (mostly google-symptoms). And my computer does not allow me to run on the full dataset. Will either wait for @korlaxxalrok's automation or get a bigchunk to run to get some outputs for the correlation analysis. Since Brian is working on other staffs now, we might not have the automation very soon. I plan to start it this weekend. If you are interested or have time to do the correlation analysis before that, it would be great. Actually I am not sure whether I should ping you here (maybe not. Here it seems you are automatically added).

jingjtang commented 3 years ago

@krivard A simple state-level comparison here. Something to consider:

Do we want to add smoothed signals?
Don't see a strong correlation here. The curve for USA-Facts is scaled by the maximum value across all the states while the curves for bars_visit_prop and restaurants_visit_prop are scaled by the max(bars_visits_prop$value.max(), restaurants_visit_prop$value.max()). This makes the curves comparable between states. And it is also comparable between bars_visit_prop and restauratns_visit_prop. The values are now in arbitrary unit and cannot be compared between safegraph-patterns signals and USA-Facts incidence prop.

For most of the states, number of visits to restaurants is much larger than the number of visits to bars (normalized by population) except for LA from late May to late July. People do show awareness of protecting themselves by reducing the number of visits to bars/restaurants in early March to some extent. But after May, or even earlier in some states like FL, AL, IA etc, the number of visits to bars/restaurants gradually increases though the number of confirmed cases also increases. safegraph_patterns_compare_with_usafacts.pdf

(Edited on 11-17)

Other things to consider:

What's the source name we want to use for this one?
Do we want to make a separate API documentation for SafeGraph Patterns or do we want to combine them?
The current signal names are:
- bars_visit_num
- bars_visit_prop
- restaurants_visit_num
- restaurants_visit_prop
What's the earliest date that we want to report to the public? We have data dating back to 2018-12-31.

benjaminysmith commented 3 years ago

@krivard Smoothed signals: is there a policy to release smoothed signals? If not seems easier to release the raw one first.
@jingjtang What's the correlation like with the existing safegraph indicator?

For the other things to consider:

@jingjtang What is your recommendation on the source name and combining them?
@krivard any overall policy on this? I see April often used as a start date in other analyses.

krivard commented 3 years ago

is there a policy to release smoothed signals? If not seems easier to release the raw one first.

No policy, go ahead with just the raw ones and we can add smoothed by request at a later date.

What's the earliest date that we want to report to the public?

February 1 2020 is the standard we've been using for all signals that go back that far or earlier.

capnrefsmmat commented 3 years ago

Once a preview of this data is available, we should probably ping Larry, since he's interested in using it as soon as it is ready.

jingjtang commented 3 years ago

@capnrefsmmat A simple state level visual view here. Will have comprehensive visual view once it is automated with approved source name and signal names.

jingjtang commented 3 years ago

In addition to the time series comparison, the geo-wise correlation analysis is more interesting. Correlating usa-facts: confirmed_7dav_incidence_prop averaged from 2020-10-23 to 2020-10-25 against safegraph_patterns: restaurants_visit_prop from 2020-10-23 to 2020-10-25

State Level

state_bars_visit state_restaurants_visit

MSA level
County Level

The restaurants_visit_prop is much more powerful than the bars_visit_prop especially in populated areas at MSA/County level.

jingjtang commented 3 years ago

API documentation draft here

benjaminysmith commented 3 years ago

@RoniRos can you please review the signal names here.

benjaminysmith commented 3 years ago

@capnrefsmmat are the correlations above adequate for statistical review?

capnrefsmmat commented 3 years ago

I think so. The main motivation here is for a project studying mobility and behavior specifically, so even if the correlations weren't great, that would be an important finding.

benjaminysmith commented 3 years ago

Fabulous. Marking statistical correlation as done.

benjaminysmith commented 3 years ago

@jingjtang can you please also prepare mailing list and signal popup drafts?

jingjtang commented 3 years ago

@benjaminysmith I think @Akvannortwick is in charge of the mailing list. I will do the signal popup drafts soon. Thanks for the reminder.

Akvannortwick commented 3 years ago

@benjaminysmith I will start the drafts and can link you to the google doc once I have something @jingjtang I am taking a look right now

Akvannortwick commented 3 years ago

@benjaminysmith I have a draft of the email, found here. If there are any necessary changes or suggestions that anyone wishes to implement, please use the suggest function. I will send it out once it has been reviewed and I recieve word to do so.

RoniRos commented 3 years ago

Thanks @Akvannortwick , I am reviewing it now.

I see @krivard's question above about what source name to use, but do not see any further discussion. Is there a reason not to use the same source name as the other SafeGraph indicators, namely 'SafeGraph'? There is a reference in the documentation to 'SafeGraph Mobility' vs. 'SafeGraph Patterns', but the official source name for the former is still 'SafeGraph', I believe. @huisaddison : what is the significance of 'SafeGraph Mobility' vs. 'SafeGraph Patterns'? are these just two separate SafeGraph DBs?

RoniRos commented 3 years ago

One more thing, regarding how far back these indicators should go: According to the documentation, these are supposedly absolute numbers of visitors (normalized by population only. It is important to be able to intretpret these numbers relative to pre-pandemic levels. We could do it in one of two ways:

Make the signal available for 2019 as well, i.e. from 1-Jan-2019.
change the signal to be normalized by the pre-pandemic level on the same date (a method used by Google mobility I believe) @huisaddison Is there a discussion of this issue in the SafeGraph documentation.

huisaddison commented 3 years ago

@RoniRos Our Safegraph documentation: no, we do not discuss this. I don't believe Safegraph's own documentation discusses this either (I assume their philosophy is that the end-user can do their own normalization, which is more transparent). Google's pre-normalization has actually been annoying for the modeling team, since the normalization is described in vague terms on their website.

I am in favor of making the signal available for 2019 as well, and letting the end-user perform the normalization.

huisaddison commented 3 years ago

@RoniRos Safegraph Mobility vs Safegraph Patterns. The way we landed on this is very confusing

Safegraph produces a number of datasets. We took their social distancing signals, and started referring it to "Safegraph Mobility" because they measure mobility, and are from safegraph (as opposed to "Google Mobility" signals that we never built a pipeline for)

Then we took their patterns signals (named so presumably because they measure _consumer patterns). And I think that we inherited the "patterns" name when @jingjtang wrote the new pipeline to process this distinct dataset.

We could rename our Safegraph [Mobility] to Safegraph Social Distancing for uniformity, if desired... https://docs.safegraph.com/docs

jingjtang commented 3 years ago

I made a separate pipeline because the format of the raw dataset is very different. In SafeGraph Mobility (The Social Distancing dataset), we read a series of files ~500MB and each file contains data for a single day. Each row of a file represents a census block group (please correct me if I am wrong @huisaddison). The signals that we are interested in can be directly got from the columns such as 'median_home_dwell_time'.

However in SafeGraph Patterns (the Weekly Patterns dataset), we should read a series of large files (>1GB each before gunzip), each one contains the data for a specific week. In one file, each line represents a specific place. And there is a special column visit_by_day which is in the format of [num1, num2, num3, ..., num7]. (num X represents the number of visits to that place in day X of that week). Besides, since we only consider bars and restaurants related places currently, there is also a filtration based on the naics code provided in another dataset "Core Places" which is also super large. （And I choose to store only the files that are needed in ./static folder of the safegraph_patterns pipeline to save memory. This is because the mapping from places to naics_code is not updated frequently.）

I think users won't care too much about how we pre-process the raw dataset. So, I prefer not to combine those two pipelines but naming both of the source to be safegraph is fine to me.

RoniRos commented 3 years ago

@jingjtang's argument makes sense to me, too. From a user's perspective, these are all from the same source, 'SafeGraph', in the sense in which we usually mean the term 'source' in our API.
If we accept this, then the API documentation needs to be updated, including removing reference to 'Safegraph Mobility' as a source name (but it can remain as reference to specific DBs within SafeGraph).

jingjtang commented 3 years ago

@RoniRos The normalization for signals here from the SafeGraph Patterns is a problem. The raw dataset do not provide any other daily values except for the number of visits.

The only thing that can possibly serve as the denominator is raw_visitor_count which is the number of unique visitors to a specific place in a specific week, problem for this one is

weekly, no daily value available
cannot aggregate it to any other geo-resolution since it counts unique visitors to specific places (cannot get intersections).

The raw_visitor_count is not consistent across time. So, it does cause problem to the current normalization, but do not have a better solution.

jingjtang commented 3 years ago

@Akvannortwick @benjaminysmith The draft of the signal description pop-up text here under Name: People’s Visits to Bars and Name: People’s Visits to Restaurants

Akvannortwick commented 3 years ago

@jingjtang Thanks for the link, I will take a look at it and make any suggestions if there are any.

benjaminysmith commented 3 years ago

@jingjtang -- to make sure I understand, is your argument that we should not attempt normalization due to those blockers (and Addison supports this regardless)?

benjaminysmith commented 3 years ago

Also @jingjtang, per Roni:

If we accept this, then the API documentation needs to be updated, including removing reference to 'Safegraph Mobility' as a source name (but it can remain as reference to specific DBs within SafeGraph).

It looks like you already have done this in https://github.com/cmu-delphi/delphi-epidata/pull/274/files. Can you please confirm?

Do you also have some quick charts for a visual review?

jingjtang commented 3 years ago

@jingjtang -- to make sure I understand, is your argument that we should not attempt normalization due to those blockers (and Addison supports this regardless)?

Currently we only do the normalization based on population. It hasn't been talked in the code review. But if you have any suggestions @huisaddison, that would be very helpful.

It looks like you already have done this in https://github.com/cmu-delphi/delphi-epidata/pull/274/files. Can you please confirm?

Yes. I accepted this and updated the PR for the API documentation. If @krivard has other concerns, we can change it back.

Do you also have some quick charts for a visual review?

Yes, a view view is uploaded here

huisaddison commented 3 years ago

I think that raw counts and pop normed is fine. Normalizing for a "non-COVID time range" can be done by the end user, assuming we provide historical data starting in January 2019.

I happened to look at the (now merged PR) that was linked in this issue, and I found a typo: https://github.com/cmu-delphi/covidcast-indicators/pull/225#discussion_r526257756 I commented in the PR, not realizing it was already merged. If the typo is still in the (merged) DETAILS.md can it please be fixed? (It mentions deaths; I assume the DETAILS.md for a deaths indicator was used as a template).

jingjtang commented 3 years ago

@huisaddison Thanks, I will fix that.

benjaminysmith commented 3 years ago

@jingjtang thanks sorry I missed that.

@capnrefsmmat can you please take a look at and approve in comments here (if it looks good to you) the visual review here.

capnrefsmmat commented 3 years ago

The visual review looks plausible, though I wonder why Alabama has so many more restaurant visits than other states. Seems a little strange, but if that's really in the source data, I guess that's how it is.

benjaminysmith commented 3 years ago

Agreed -- Alabama on 2020-03-07 is 50% higher than the surrounding states for restaurants, and looks like there is a similar effect for Louisiana for bars -- e.g. on 2020-07-18 the bar prop is 1364 vs e.g, 36 for Texas.

@jingjtang can you spot check these?

jingjtang commented 3 years ago

@capnrefsmmat @benjaminysmith This notebook shows how the source data is.

The number of POIs (either bar or restaurant) varies a lot across regions but remains nearly constant over time, except for the change point week 2020-06-09 to 2020-06-15 (due to lockdown). In general, the number of bars is much less than the number of restaurants in SafeGraph Core Places dataset. There are 20 unique brands considered for bars while 962 brands considered for restaurants.

The number of restaurants considered in AL is quite large than other states but the number of bars considered in quite small.

korlaxxalrok commented 3 years ago

@benjaminysmith @jingjtang I am planning on putting safegraph_patterns onto prod today. The code is there, but I still need to schedule it and run the first manual ingest.

Is there any reason to hold off on doing this?

benjaminysmith commented 3 years ago

Update: we decided to hold off on pushing to prod until Monday but this is currently in flight.

korlaxxalrok commented 3 years ago

safegraph_patterns has had its initial run, data ingested, and is now scheduled for Thursdays at 12:02 (one minute after safegraph).

cmu-delphi / covidcast-indicators

Release SafeGraph_Patterns #402