Code Request 2 - Githubissues

Code review request for

Storefronts_cd_map.Rmd (Reese)
Storefronts_clusterinng_map.Rmd (Reese)
Storefronts_nta_map_21_22.Rmd (Brook)
median_income_vacancy_21_22.R (Brook)
sf_rent_explore_21_22.R (Brook)
You can ignore Storefronts_ct_map.Rmd (Reese), that one will not be added to the web page.

Some things:

I did some weird things to be able to merge on the geoids in Storefronts_ct_map.Rmd & Storefronts_clusterinng_map.Rmd , don't worry if it's confusing - Brook double-checked last week!
Because of ongoing councildown updates I had issues using colorBin/some leaflet features, which is why I'm using leaflet::colorBin instead of councildown::colorBin, etc.

I'm going to be pushing small edits to the cr branch - feel free to look at what I've changed there and take what you like + drop what you don't. I'll also post a lil comment here as I get to each file. I'm having a lot of leaflet/shapefile replication issues that I mostly solve below but are likely just versioning and might break it for you.

Storefronts_cd_map.Rmd

Some general formatting changes for readability
I changed the definition of vacant_cd and vacant_cd.shp because - all I did was move the ungroup() %>% up before the left join and add as.data.frame(). Did this because otherwise I got an sf/tbl object which couldn't be plotted.
I changed the line 81 to use quantile, feel free to take as you feel! I like this more because it uses the full range of colors more so you can see more city wide variation, and it ends up showing you the true min/max in the legend. First image is the quantile breaks, second is the SD breaks.

Storefronts_clusterinng_map.Rmd

Reads in the vacancy data as ct_vacant_dataset but then refers to it as leased_not_leased_2022, so changed it to match the Storefronts_cd_map.Rmd naming.
Changed the chunk around row 46 to be tidyverse lingo just to help me understand. I still don't quite understand
In the chunk starting row 55 we're dropping the decimal in order to be able to merge with the storefront data - which only provides the round numbers for the locations. It's BANANAS to me that they provide the data that way, by removing the decimals they are deciding to use old census tract designations, but only for tracts where there has been enough growth to merit them being divided. Frankly it makes no sense that they did that, but given that we're trying to make it work, I've at least condensed our spatial file. Previously we were merging each storefront to all tracts that it could match (ie if the storefront is reported to be in CT 2, we would merge it with both CT 2.01 and CT 2.02) now I've combined the shapes so that there will only be one existing shape labelled 2. We could use the lat lon to find the correct 2020 tract but there are some missing obs. The spatial difference is less bad than I had thought it might be - places where you see a red line are where I've combined the two (or more) tracts.
I didn't go through the classification code! but would recommend re-running that part with the group_by(boro_ct201) %>% summarize(geometry = st_union(geometry)) addition.
the nested ifelse on row 118 is a perfect for a case_when!
I'll think unsupervised clustering can be really confusing for people who don't have a background in cs/stats - I think understanding what our takeaway from this clustering is will be really important in communicating. It may be useful to create another graphic that shows the differences between these clusters more directly (eg, different vacancy trends by size would be a great scatterplot!)

will pop through the next few files after a break!

Storefronts_nta_map_21_22.Rmd

Using the same palette as the other map, and with similar data - I think it would make sense to either change the color choices or match the breaks to the CD map.
did some mini reformatting just through the process of understanding what's happening
I'm getting a lot of weird errors again about plotting - maybe related to leaflet/councildown interactions? Had to mess around with the ordering of layers etc to get it to plot at all. Changed some of the inputs to be more councildown styled.
in the row 122 chunk I reinterpreted some of the data.table code in dplyr verbiage to help me understand what it was doing and I got different numbers for glendale + springfield - not sure why.

median_income_vacancy_21_22.R

@fryenycc
census api key is in there!!!! danger, will robinson
censusapi is a new library to me, I use tidycensus but this seems like it has some cool pros to it so good to learn about it!
frankly I don't super trust myself to review this because it's super DT heavy.
In row 41 about 25% of the vacancy data doesn't seem to have a match in the ACS data, presumably because of the weird tract data that we have from the vacancy data. It seems like the data is getting implicitly dropped in merge(sr, med_inc, by = "geoid") not sure if it would better to make that explicit in the text somehow or to make the same choice as above of basically merging the tracts on the census side as well. It is going to be dropping things non-randomly - looks like there are a lot more tracts in SI + Queens that will be effected.
Throwing an idea out there - seems like the color is just highlighting the x axis, but could be highlighting the relevance of tracts, since some have very few storefronts. I am usually anti using size since ppl interpret it really poorly but just as an example, could also be color based or just alpha. The story I take away from each of those is pretty different I think! Ignore the changes [e: in the code] for the plot though since it was just me messing around.

[e: Excusing awful formatting bc it's just a test. Each bar here represents all tracts within a certain income bracket (ie the average vacancy for all tracts w a median income between 10k-20k for the first bar). This shows a really different trend than I think was implied by the first chart. I haven't put too much thought into it yet - maybe I'm grouping things in a way that isn't reasonable, but shows an almost inverse takeaway than the scatterplot format. This feels consistent with the cd map, where the highest vacancy zones are fairly high income, lower manhattan and nw brooklyn.

]

sf_rent_explore_21_22.R

references a file well outside the repo? if it's general utils should just go in councildown. ~/utils/unzip_files.R [e: I think this was just for unzip_sf instead of loading councildown!]
I changed the missing_storefront_2.csv reference to a relative path that matches other files
census api token!
loaded in councildown, fixed some stuff in merges but not sure what else this file needs to do

feel free to ping me about anything above - and would love any general thoughts

Thanks, @amd112 for flagging this! I agree with combining the geographies & will send an email to the Open Data folk asking them to fix the census tracts / prevent them from rounding. Combining the shapes is a great idea, I think also worth fixing the original map...

Storefronts_clusterinng_map.Rmd

Reads in the vacancy data as ct_vacant_dataset but then refers to it as leased_not_leased_2022, so changed it to match the Storefronts_cd_map.Rmd naming.

Changed the chunk around row 46 to be tidyverse lingo just to help me understand. I still don't quite understand

In the chunk starting row 55 we're dropping the decimal in order to be able to merge with the storefront data - which only provides the round numbers for the locations. It's BANANAS to me that they provide the data that way, by removing the decimals they are deciding to use old census tract designations, but only for tracts where there has been enough growth to merit them being divided. Frankly it makes no sense that they did that, but given that we're trying to make it work, I've at least condensed our spatial file. Previously we were merging each storefront to all tracts that it could match (ie if the storefront is reported to be in CT 2, we would merge it with both CT 2.01 and CT 2.02) now I've combined the shapes so that there will only be one existing shape labelled 2. We could use the lat lon to find the correct 2020 tract but there are some missing obs. The spatial difference is less bad than I had thought it might be - places where you see a red line are where I've combined the two (or more) tracts.

I didn't go through the classification code! but would recommend re-running that part with the group_by(boro_ct201) %>% summarize(geometry = st_union(geometry)) addition.

the nested ifelse on row 118 is a perfect for a case_when!

I'll think unsupervised clustering can be really confusing for people who don't have a background in cs/stats - I think understanding what our takeaway from this clustering is will be really important in communicating. It may be useful to create another graphic that shows the differences between these clusters more directly (eg, different vacancy trends by size would be a great scatterplot!)

will pop through the next few files after a break!

NewYorkCityCouncil / vacant_storefronts

Code Request 2 #2

Storefronts_cd_map.Rmd

Storefronts_clusterinng_map.Rmd

Storefronts_nta_map_21_22.Rmd

median_income_vacancy_21_22.R

sf_rent_explore_21_22.R

Storefronts_clusterinng_map.Rmd