regionalization metric - Githubissues

lindsayplatt commented 7 years ago

preliminary implementation of the "regionality" scale for each app at the year scale. It doesn't need to be included in the app now and you could just make one bar chart figure for it. A "spearman rank correlation" will be a good way to get the index.

@jread-usgs in #internal-analytics Slack channel

I think we should calculate a "regionalization metric" (or some other name) that gets at the local vs national scale of the particular web page/app. I was thinking of a scale from 0 to 10, where 0 is local and 10 is national. To do this, we could (this is just an idea, but would make sense to think of different wants to do it) calculate the percentage of user traffic expected for each state based on population, and compare the actual percentages for each time period. A perfect match (e.g., 12.1% of traffic from CA, 1.8% from WI, 8.6% from TX, etc...) would be a 10 ("national"), which is very unlikely, and a lesser match (such as 80% of traffic from WI) would be closer to 0, or on the "local" end of the spectrum. So each app for each time period would have a score, and we could also have a cross-app score for comparison.

where would/could this be? It could be a horizontal bar below the map w/ fill between 0 and the score, and then a vertical slash that is taller than the bar thickness representing the mean/median of all apps for the same time period. If it was a useful way to distill this concept into a number like this, figure one could be a single column for the time period viewed, then column two could be same 0-10 bar, and a smaller column three could be the name of the state that is the greatest positive outlier from the expected distribution of visitors. Figure one would then also be reactive to the time period selected (instead of having all three).

Today's meeting discussed the need to tell better stories about what we do, and come up with better ways to measure them. I think this is a way can get a quick metric on the reach of many of the products that would contain these "stories". We have the gross numbers for visits clearly presented, but I think the "where" is just as important as how many to have access to as a quick view.

@aappling-usgs reaction:

cool idea! some of the more internally facing apps have high concentrations of users in california, colorado, and virginia - agreed that most important would be regionalization metric w.r.t. all users, but after that i'd be interested to see how enterprise apps are serving USGS users, which we could do by dividing by fraction of USGS that's in each state. could the all-users map toggle between being total per state and total per capita? (horizontal bar is fine too)

also, sorta wondering if there's a biodiversity / spatial autocorrelation metric for this

lindsayplatt commented 7 years ago

@jread-usgs @aappling-usgs I think I setup the correct comparison data to get these ranks. I just don't really know what to do for the actual index. I'm using cached session data from the process step, lastest_year.rds.

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
app_traffic <- readRDS('cache/process/lastest_year.rds')
app_traffic_grp <- dplyr::group_by(app_traffic, region)
app_traffic_sum <- dplyr::summarize(app_traffic_grp, traffic = sum(users))

# merge app traffic w/ population
app_traffic_sum <- dplyr::rename(app_traffic_sum, state = region)
traffic_by_state <- dplyr::left_join(us_pop, app_traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_by_state$traffic)
traffic_by_state <- dplyr::mutate(traffic_by_state, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index....
corr_result <- cor.test(x = traffic_by_state$pop_pct,
                        y = traffic_by_state$traffic_pct,
                        method = "spearman")

Am I trying to compare the right things?

jordansread commented 7 years ago

I think so... traffic_by_state$traffic_pct is the actual percent of traffic for each state over the time period (yearly here, but would apply to month or week as well) and traffic_by_state$pop_pct is the percent of total us pop in that state? can you calc the metric for a could of apps and share? Or do a barchart/lollipop with a point for each app and then we can see if it passes the smell test.

aappling-usgs commented 7 years ago

agreed with jread - looks good to me so far.

lindsayplatt commented 7 years ago

traffic_by_state$traffic_pct = percent of total users in each state traffic_by_state$pop_pct = percent of total population in each state

jordansread commented 7 years ago

Possible to get the app names on the bar plot @lindsaycarr ?

lindsayplatt commented 7 years ago

The dataset I have cached only has viewIDs, and no corresponding app names. I'll run make and see if it changes. If not, I might be able to work something out -- not sure how quickly

lindsayplatt commented 7 years ago

Working something quick out with David and Walker, so new plot w/ names coming soon

lindsayplatt commented 7 years ago

Accompanying script:

# for now, using US population dataset from 1977

# get population and percentage of total US population per state
us_pop <- data.frame(state = row.names(state.x77), pop = state.x77[,'Population'], stringsAsFactors = FALSE)
us_total <- sum(us_pop$pop)
us_pop <- dplyr::mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_grp <- dplyr::group_by(traffic, viewID, region)
traffic_sum <- dplyr::summarize(traffic_grp, traffic = sum(users))
traffic_sum <- dplyr::ungroup(traffic_sum)

# merge app traffic w/ population
traffic_sum <- dplyr::rename(traffic_sum, state = region)
traffic_pop <- dplyr::left_join(us_pop, traffic_sum)

# throw in app names
gaList <- yaml::yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- dplyr::left_join(traffic_pop, gaTable)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- dplyr::mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- dplyr::filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

jordansread commented 7 years ago

looks like we need Pearson correlation instead of Spearman

lindsayplatt commented 7 years ago

need to grab population data that is not from 1977 before we start making too many conclusions based on this

lindsayplatt commented 7 years ago

with 2012 population data:

library(choroplethr)
library(dplyr)
library(yaml)

# get population and percentage of total US population per state
# 2012 population data

data(df_pop_state)
us_pop <- df_pop_state
us_pop <- rename(us_pop, pop = value)
us_total <- sum(us_pop$pop)
us_pop <- mutate(us_pop, pop_pct = pop/us_total*100)

# what is being ranked? actual population percentage (x) vs percentage of traffic from each state (y)

# get traffic data based on users by state
# this is data from one app
traffic <- readRDS('cache/process/lastest_year.rds')
traffic_sum <- group_by(traffic, viewID, region) %>%
  summarize(traffic = sum(users)) %>%
  ungroup() %>%
  mutate(region = tolower(region))

# merge app traffic w/ population
traffic_pop <- left_join(us_pop, traffic_sum)

# calculate percentage of traffic for each state
traffic_total <- sum(traffic_pop$traffic)
traffic_pop <- mutate(traffic_pop, traffic_pct = traffic/traffic_total*100)

# throw in app names
gaList <- yaml.load_file('data/gaTable.yaml')
gaTable <- data.frame(viewID = unlist(rvest::pluck(gaList, 'viewID')),
                      appName = unlist(rvest::pluck(gaList, 'shortName')),
                      stringsAsFactors = FALSE)
traffic_pop <- left_join(traffic_pop, gaTable)

# now calculate the regionality index for each app...

# get spearman's rho
all_corr <- sapply(unique(traffic_pop$appName), function(nm, traffic) {
  app_traffic <- filter(traffic, appName == nm)
  corr_result <- cor(x = app_traffic$pop_pct,
                     y = app_traffic$traffic_pct,
                     method = "spearman")
  return(corr_result)
}, traffic = traffic_pop)

par(mar = c(2, 10, 0, 2))
barplot(all_corr, xlab = "App ID", horiz=TRUE, las=1)

par(mar = c(5, 4, 4, 2))
hist(x = all_corr, xlab = "Spearman's Rho", main = "Hist of Spearman's Rho for all apps")

jordansread commented 7 years ago

@lindsaycarr can you try it w/ Pearson's R?

lindsayplatt commented 7 years ago

Hmm we've got some negative guys in here. Added the top contributing state next to each bar.

Pearson's r

lindsayplatt commented 7 years ago

Some seem reasonable, but a lot are skewed because all the developers sit here in WI.

lindsayplatt commented 7 years ago

spearman w/ top states:

jordansread commented 7 years ago

I am not sure the best stat for this, but I think the linear relationship that Pearsons test is more appropriate than the ranking test Spearmans looks for (such as "are the order of states the same?"). I also think lumping all < 0.1 into a score of 0.1 for the time being would be fine.

What this isn't doing yet is properly dealing w/ zeros. For example, the NPS mercury viewer has a pretty "national" score, but really only has a small number of visits from 4 states.

aappling-usgs commented 7 years ago

agreed w/ jread that spearman's may look prettier, but pearson's makes more sense - i'm more interested in differences between the proportions of population and proportions of traffic than i am in any differences in how states rank relative to one another by population vs traffic.

alternatively, what about a metric constructed from the differences between traffic & population proportions (or quotients, i.e., traffic per capita)? what about root mean squared difference between traffic and population proportions? then the 0s would be -1, which is a big number, and big RMSDs would indicate a lack of evenness (numbers close to 0 indicate greater evenness).

jordansread commented 7 years ago

Thinking about this again...what we really want is a metric that gives us some measure of how far off the "real" percentages are from falling on a 1:1 with the "expected" (or population-based) numbers. So I think the slope here is important. Neither of the two metrics we have tried so far do this.

Ok, just saw alison weighed in too

aappling-usgs commented 7 years ago

yeah, agreed on the slope. i think RMSD fixes that part. unsure whether it introduces any other problems...would like to see how it looks.

aappling-usgs commented 7 years ago

revisiting my previous comment: i guess i'm thinking normalized differences, like this:

sqrt(mean((traffic_pct - pop_pct) / pop_pct )^2))

with the caveat that if pop_pct is 0.18% (wyoming), then traffic_pct of 1% becomes a huge number (5), and if there were only 100 visitors in a time period, traffic_pct could pretty much randomly flip between 1% and 2%. so small states will contribute noise. the alternative is absolute differences, which will pretty much ignore small states altogether.

jordansread commented 7 years ago

that's a RMSE calculation, isn't it? ☝️ or do those two mean the same thing...

aappling-usgs commented 7 years ago

yes; i just avoided E because they're not really errors in my mind.

the clarification was RMS[E/D] of the relative differences rather than the absolute differences. absolute differences would look like this:

sqrt(mean((traffic_pct - pop_pct))^2))

and i realized i hadn't specifically suggested relative differences in my previous comment.

jordansread commented 7 years ago

I think we'll need to boil this down to a scale that is going to be simple to pick up quickly- like a scale from 1-10 or 0-1. From local to national. Tricky to get a RMSD into that kind of scale but maybe there is a simple way to do that.

aappling-usgs commented 7 years ago

what about R^2 for traffic_pct ~ pop_pct? it penalizes relationships with slope != 1, usually ranges between 0 and 1. caveat is that it can be negative for this type of dataset (actual range is -Inf to 1). we could just truncate at a lower limit of 0 for plotting purposes...

example code for the above plots:

pop_pct <- seq(1,100,by=2)
traffic_pct <- pmax(0, rnorm(n=50, mean=seq(0,100,length.out=50), sd=3))
plot(traffic_pct ~ pop_pct, ylim=c(0,180), xlim=c(0,180)); abline(a=0,b=1)
met <- 1-sum((traffic_pct - pop_pct)^2)/sum((traffic_pct - mean(traffic_pct))^2)
title(paste0('R^2 = ', met))

jordansread commented 7 years ago

That allows an offset though, right? I thought this was the way to pin 1:1 and intercept of 0:

lm(y ~ 0 + offset(x))

but it doesn't have any freedom to it, so you can't calculate an R2

aappling-usgs commented 7 years ago

i thought we wanted to penalize slopes different from 1, which the above code does. i did the calculation from formula rather than fitting an lm; i'm not sure how you'd fit an lm so it penalized non-1 slopes

aappling-usgs commented 7 years ago

summary(lm(traffic_pct ~ 0 + offset(pop_pct)))

seems to always give r.squared = 0:

List of 9
 $ call         : language lm(formula = traffic_pct ~ 0 + offset(pop_pct))
 $ terms        :Classes 'terms', 'formula'  language traffic_pct ~ 0 + offset(pop_pct)
  .. ..- attr(*, "variables")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "offset")= int 2
  .. ..- attr(*, "factors")= int(0) 
  .. ..- attr(*, "term.labels")= chr(0) 
  .. ..- attr(*, "order")= int(0) 
  .. ..- attr(*, "intercept")= int 0
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(traffic_pct, offset(pop_pct))
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "traffic_pct" "offset(pop_pct)"
 $ aliased      : logi(0) 
 $ residuals    : Named num [1:50] 42.8 36.8 40.8 -7 15.3 ...
  ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
 $ df           : int [1:3] 0 50 0
 $ coefficients : logi[0 , 1:4] 
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ sigma        : num 38.4
 $ adj.r.squared: num 0
 $ r.squared    : num 0
 - attr(*, "class")= chr "summary.lm"

jordansread commented 7 years ago

Gotcha. I understand your's now. I was thinking that was a snippet that went into an lm()

jordansread commented 7 years ago

I didn't see your snippet at the bottom the first time I looked.

aappling-usgs commented 7 years ago

i might have added it in an edit - that's a bad habit of mine =)

jordansread commented 7 years ago

OK, I like this. @lindsaycarr and I are working on getting the data.frame set to do the math.

lindsayplatt commented 7 years ago

OK, so here is what I've got using R^2 for the most recent year of analytics data. The states listed are the ones that deviated the most from "expected" and the numbers in parentheses are the amounts they deviated.

All negative R^2s were treated as 0s

The script can be found in my branch: https://github.com/lindsaycarr/internal-analytics/blob/regionality/scripts/process/regionality_test.R

jordansread commented 7 years ago

Looks good! Alison, what do you think?

aappling-usgs commented 7 years ago

Yeah, I like it! You could probably round the numbers more if you want - looks like you seldom need 1 digit after the decimal point, let alone 2. maybe just go for 2 sig figs?

lindsayplatt commented 7 years ago

Good idea. @jread-usgs any desire to make this 0:10 instead of 0:1?

aappling-usgs commented 7 years ago

PS - i couldn't remember how to do sig figs, so looked it up. in case it's not on the tip of your fingers already:

signif(c(.1234, 1.234, 12.34, 123.4), digits=2)
## [1]   0.12   1.20  12.00 120.00

jordansread commented 7 years ago

Yes, I think 0-10 as a metric would be great.

Would it be possible to mock up figure1 with this as the second column (only two columns total)?

lindsayplatt commented 7 years ago

@jread-usgs I just read this again, but I'm actually not sure what you are talking about. Add this to Fig 1 of the internal-analytics home page? So the yearly analytics bars as column one and this as column two?

jordansread commented 7 years ago

Correct @lindsaycarr

lindsayplatt commented 7 years ago

struggling to get the facets for yearly session totals to scale freely

aappling-usgs commented 7 years ago

what have you tried? does this help? https://stackoverflow.com/a/21585521/3203184

lindsayplatt commented 7 years ago

So, I didn't touch the plotting code; just changed the data. The plotting code has:

facet_grid(bin ~ type, scales = "free",
               space = "free_y", drop = TRUE)

which I think would do this appropriately. I've tried all options for scales and space (free, fixed, free_y, free_x)

aappling-usgs commented 7 years ago

and if you delete the regionalization rows, everything is nice again?

ldecicco-USGS commented 7 years ago

facet_grid is designed to use the same scale in x in 1 column, or the same y scale in 1 row...so you have to scale it yourself if you want to bypass that.

If you are using the processed data from trend_all (?), then you should be able to use "scaled_value", "scaled_newUser", and "text_placement".

lindsayplatt commented 7 years ago

Ha - just figured it out. Needed it to ignore scaled_newUser

lindsayplatt commented 7 years ago

Wait, that wasn't exactly it. I don't really understand why, but having both of these lines makes it not scale freely, but removing either of them allows it to.

geom_segment(aes(xend = longName), yend=0, size = 0.65, color = bar_line_col) +
geom_segment(aes(xend = longName, y = scaled_value),
                 yend=0, col=bar_line_col, size=1.15)

Result when removing first line:

aappling-usgs commented 7 years ago

/shrug !

wdwatkins commented 5 years ago

@lindsayplatt what is the status of this work? Is the metric settled and ready for prime time?

USGS-VIZLAB / internal-analytics

regionalization metric #177