emmaebowe commented 2 years ago

Redistricting requirements

In North Carolina, districts must:

be contiguous
have equal populations
be geographically compact
preserve county and municipality boundaries as much as possible

Interpretation of requirements

We enforce a maximum population deviation of 0.5%. We add a county constraint. We add a VRA constraint targeting two majority-minority districts to ensure that the simulated plans are similar to the ratified ones on this metric.

Data Sources

Data for North Carolina comes from https://redistricting.lls.edu/state/north-carolina/

Pre-processing Notes

No manual pre-processing decisions were necessary.

Validation

validation_20221201_1923

SMC: 93,980 sampled plans of 13 districts on 2,692 units
`adapt_k_thresh`=0.985 • `seq_alpha`=0.5
`est_label_mult`=1 • `pop_temper`=0.05

Plan diversity 80% range: 0.63 to 0.86

R-hat values for summary statistics:
   pop_overlap   vap_minority      total_vap       plan_dev      comp_edge    comp_polsby 
      1.000192       1.002821       1.002719       1.002275       1.001458       1.008534 
     pop_white      pop_black       pop_hisp       pop_aian      pop_asian       pop_nhpi 
      1.004743       1.008887       1.011394       1.004934       1.018803       1.004573 
     pop_other        pop_two      vap_white      vap_black       vap_hisp       vap_aian 
      1.013519       1.010712       1.009766       1.006980       1.013302       1.004933 
     vap_asian       vap_nhpi      vap_other        vap_two pre_16_rep_tru pre_16_dem_cli 
      1.021337       1.003438       1.016668       1.026083       1.026572       1.015349 
pre_20_rep_tru pre_20_dem_bid uss_16_rep_bur uss_16_dem_ros uss_20_rep_til uss_20_dem_cun 
      1.024176       1.021850       1.026825       1.014634       1.022281       1.020062 
gov_16_rep_mcc gov_16_dem_coo gov_20_rep_for gov_20_dem_coo atg_16_rep_new atg_16_dem_ste 
      1.023122       1.014265       1.025408       1.017432       1.027132       1.014347 
atg_20_rep_one atg_20_dem_ste sos_16_rep_lap sos_16_dem_mar sos_20_rep_syk sos_20_dem_mar 
      1.022599       1.017930       1.018803       1.014611       1.022034       1.017538 
        adv_16         adv_20         arv_16         arv_20  county_splits          e_dvs 
      1.013712       1.017221       1.025355       1.023205       1.018128       1.023546 
        pr_dem          e_dem          pbias           egap 
      1.002445       1.000607       1.003314       1.001216 

Sampling diagnostics for SMC run 1 of 2 (50,000 samples)
         Eff. samples (%) Acc. rate Log wgt. sd   Max. unique Est. k 
Split 1    34,400 (68.8%)     16.1%        0.45 31,732 (100%)      8 
Split 2    35,406 (70.8%)     19.7%        0.66 28,102 ( 89%)      6 
Split 3    35,322 (70.6%)     26.2%        0.71 28,529 ( 90%)      4 
Split 4    35,321 (70.6%)     24.1%        0.75 28,435 ( 90%)      4 
Split 5    35,109 (70.2%)     28.5%        0.78 28,407 ( 90%)      3 
Split 6    35,501 (71.0%)     35.0%        0.78 28,268 ( 89%)      2 
Split 7    34,887 (69.8%)     32.2%        0.78 28,363 ( 90%)      2 
Split 8    34,698 (69.4%)     29.1%        0.80 28,419 ( 90%)      2 
Split 9    33,244 (66.5%)     25.8%        0.82 28,196 ( 89%)      2 
Split 10   34,283 (68.6%)     11.9%        0.83 27,375 ( 87%)      4 
Split 11   35,413 (70.8%)      9.2%        0.82 26,440 ( 84%)      4 
Split 12   30,565 (61.1%)      4.2%        0.80 24,411 ( 77%)      3 
Resample    6,736 (13.5%)       NA%        1.40 18,776 ( 59%)     NA 

Sampling diagnostics for SMC run 2 of 2 (50,000 samples)
         Eff. samples (%) Acc. rate Log wgt. sd   Max. unique Est. k 
Split 1    34,343 (68.7%)     14.4%        0.45 31,737 (100%)      9 
Split 2    35,292 (70.6%)     19.5%        0.67 28,275 ( 89%)      6 
Split 3    35,476 (71.0%)     26.0%        0.71 28,349 ( 90%)      4 
Split 4    35,464 (70.9%)     30.9%        0.75 28,409 ( 90%)      3 
Split 5    35,359 (70.7%)     22.1%        0.77 28,388 ( 90%)      4 
Split 6    35,059 (70.1%)     20.2%        0.78 28,401 ( 90%)      4 
Split 7    34,651 (69.3%)     23.7%        0.80 28,186 ( 89%)      3 
Split 8    34,567 (69.1%)     29.4%        0.81 28,183 ( 89%)      2 
Split 9    34,180 (68.4%)     18.8%        0.83 27,949 ( 88%)      3 
Split 10   33,820 (67.6%)     15.9%        0.85 27,250 ( 86%)      3 
Split 11   34,525 (69.1%)     12.3%        0.84 26,269 ( 83%)      3 
Split 12   31,402 (62.8%)      3.2%        0.83 24,364 ( 77%)      4 
Resample    7,653 (15.3%)       NA%        1.39 18,954 ( 60%)     NA 

•  Watch out for low effective samples, very low acceptance rates (less than 1%), large std.
devs. of the log weights (more than 3 or so), and low numbers of unique plans. R-hat values for
summary statistics should be between 1 and 1.05.

Checklist

[x] I have followed the instructions
[x] I have updated the tracker
[x] All TODO lines from the template code have been removed
[x] I have merged in the master branch and then recalculated summary statistics
[x] I have run enforce_style() to format my code
[x] The documentation copied above is up-to-date
[x] There are no data files in this pull request
[x] None of the file output paths (for the redist_map and redist_plans objects, and summary statistics) have been edited

@CoryMcCartan @christopherkenny

mzwu commented 2 years ago

Hi @emmaebowe, amazing work! Just a few notes:

We decided in a previous meeting that we would avoid the oversampling-subsetting method to obtain samples. I just made a pull request for the 2020 NC re-run, in which I am generating enough plans for convergence and then thinning down to 5000 plans.
With the hinge strength that you used, only 1213 plans actually obtained two MMDs. I would suggest increasing the hinge strength, or even trying the sharkfin constraint that Cory introduced.
In addition to validation plots, it is also helpful to look at performance plots for BVAP or minority VAP. The code for this is at the bottom of my 2020 NC re-run sim file.

Happy to talk about these points in more detail or answer any other questions you may have!

kuriwaki commented 2 years ago

A population deviation of over 13% in the enacted map (the red line in the population deviation histogram) seems to high. It seems like we are correctly loading 2000 Census data, so maybe cd_2010 is not actually the lines drawn for the 2012-2020 cycle? I added some comments on where the mixup may be happening.

christopherkenny commented 2 years ago

Hey Emma, I'll do a full review later today, but re the discussion today, one way to "check" performance, you can look at this plot after you run it:

    redist.plot.distr_qtys(plans, vap_black/total_vap,
        color_thresh = NULL,
        color = ifelse(subset_sampled(plans)$ndv > subset_sampled(plans)$nrv, "#3D77BB", "#B25D4C"),
        size = 0.5, alpha = 0.5) +
        scale_y_continuous("Percent Black by VAP") +
        labs(title = "Approximate Performance") +
        scale_color_manual(values = c(cd_2010 = "black")) +
        theme_bw()

(Change out vap_black/total_vap for whatever group you are checking)

emmaebowe commented 1 year ago

@christopherkenny

christopherkenny commented 1 year ago

@emmaebowe, it looks like this one still has the extra files.

emmaebowe commented 1 year ago

2010 North Carolina Congressional Districts

Redistricting requirements

In North Carolina, districts must:

be contiguous
have equal populations
be geographically compact
preserve county and municipality boundaries as much as possible

Interpretation of requirements

We enforce a maximum population deviation of 0.5%. We add a county constraint. We add a VRA constraint targeting two majority-minority districts to ensure that the simulated plans are similar to the ratified ones on this metric.

Data Sources

Data for North Carolina comes from the ALARM Project's 2020 Redistricting Data Files.

Pre-processing Notes

No manual pre-processing decisions were necessary.

Simulation Notes

We sample 24,000 districting plans for North Carolina, and thin to 5,000 final plans. No special techniques were needed to produce the sample.

>  summary(plans)
✔ Saving <redist_plans> object ... done
SMC: 5,000 sampled plans of 13 districts on 2,692 units
`adapt_k_thresh`=0.985 • `seq_alpha`=0.5
`est_label_mult`=1 • `pop_temper`=0.05
ℹ Preparing MD shapefile
Plan diversity 80% range: 0.68 to 0.88
ℹ Preparing MD shapefile
R-hat values for summary statistics:
   pop_overlap      total_vap       plan_dev      comp_edge    comp_polsby      pop_white 
      1.020806       1.016922       1.004642       1.016755       1.002336       1.014174 
     pop_black       pop_hisp       pop_aian      pop_asian       pop_nhpi      pop_other 
      1.004754       1.001528       1.002835       1.009669       1.027798       1.025220 
       pop_two      vap_white      vap_black       vap_hisp       vap_aian      vap_asian 
      1.010126       1.011099       1.003383       1.004165       1.006258       1.008749 
      vap_nhpi      vap_other        vap_two pre_16_rep_tru pre_16_dem_cli pre_20_rep_tru 
      1.045441       1.001032       1.000346       1.012643       1.022844       1.006889 
pre_20_dem_bid uss_16_rep_bur uss_16_dem_ros uss_20_rep_til uss_20_dem_cun gov_16_rep_mcc 
      1.018991       1.007392       1.019683       1.004446       1.015104       1.009775 
gov_16_dem_coo gov_20_rep_for gov_20_dem_coo atg_16_rep_new atg_16_dem_ste atg_20_rep_one 
      1.018468       1.010013       1.018432       1.007721       1.019768       1.007299 
atg_20_dem_ste sos_16_rep_lap sos_16_dem_mar sos_20_rep_syk sos_20_dem_mar         adv_16 
      1.018934       1.007604       1.017715       1.004065       1.018679       1.018821 
        adv_20         arv_16         arv_20  county_splits    muni_splits            ndv 
      1.018735       1.008668       1.007069       1.007580       1.017071       1.018352 
           nrv        ndshare          e_dvs         pr_dem          e_dem          pbias 
      1.008054       1.024337       1.024166       1.040864       1.004206       1.008595 
          egap 
      1.004343 

Sampling diagnostics for SMC run 1 of 2 (12,000 samples)
         Eff. samples (%) Acc. rate Log wgt. sd  Max. unique Est. k 
Split 1     8,255 (68.8%)     16.4%        0.45 7,586 (100%)      8 
Split 2     8,426 (70.2%)     19.5%        0.67 6,817 ( 90%)      6 
Split 3     8,505 (70.9%)     26.2%        0.71 6,851 ( 90%)      4 
Split 4     8,464 (70.5%)     30.7%        0.75 6,821 ( 90%)      3 
Split 5     8,432 (70.3%)     17.9%        0.77 6,750 ( 89%)      5 
Split 6     8,243 (68.7%)     16.4%        0.79 6,776 ( 89%)      5 
Split 7     8,453 (70.4%)     18.3%        0.79 6,798 ( 90%)      4 
Split 8     8,441 (70.3%)     21.2%        0.78 6,845 ( 90%)      3 
Split 9     8,489 (70.7%)     18.7%        0.79 6,792 ( 90%)      3 
Split 10    8,277 (69.0%)     21.9%        0.81 6,580 ( 87%)      2 
Split 11    8,780 (73.2%)     16.5%        0.79 6,357 ( 84%)      2 
Split 12    7,540 (62.8%)      4.3%        0.81 5,759 ( 76%)      3 
Resample    1,635 (13.6%)       NA%        1.39 4,491 ( 59%)     NA 

Sampling diagnostics for SMC run 2 of 2 (12,000 samples)
         Eff. samples (%) Acc. rate Log wgt. sd  Max. unique Est. k 
Split 1     8,234 (68.6%)     16.4%        0.45 7,593 (100%)      8 
Split 2     8,535 (71.1%)     19.9%        0.67 6,785 ( 89%)      6 
Split 3     8,475 (70.6%)     15.5%        0.70 6,764 ( 89%)      7 
Split 4     8,476 (70.6%)     19.5%        0.76 6,805 ( 90%)      5 
Split 5     8,422 (70.2%)     18.1%        0.77 6,828 ( 90%)      5 
Split 6     8,313 (69.3%)     20.5%        0.77 6,829 ( 90%)      4 
Split 7     8,402 (70.0%)     18.5%        0.78 6,797 ( 90%)      4 
Split 8     8,363 (69.7%)     16.2%        0.79 6,724 ( 89%)      4 
Split 9     8,209 (68.4%)     18.5%        0.79 6,759 ( 89%)      3 
Split 10    8,220 (68.5%)     15.2%        0.79 6,575 ( 87%)      3 
Split 11    8,499 (70.8%)      7.2%        0.80 6,309 ( 83%)      5 
Split 12    7,525 (62.7%)      4.2%        0.79 5,723 ( 75%)      3 
Resample    2,393 (19.9%)       NA%        1.40 4,749 ( 63%)     NA 

•  Watch out for low effective samples, very low acceptance rates (less than 1%), large std.
devs. of the log weights (more than 3 or so), and low numbers of unique plans. R-hat values
for summary statistics should be between 1 and 1.05.
ℹ Preparing MD shapefile
>

validation_20230314_2155

alarm-redist / fifty-states