HEGSRR / RPl-Spielman-2020

Replication of Spielman et al's 2020 study, Evaluating social vulnerability indicators
BSD 3-Clause "New" or "Revised" License
2 stars 19 forks source link

Final Reproduction Study Revisions #3

Closed josephholler closed 11 months ago

josephholler commented 12 months ago

Abstract

The Social Vulnerability Index

The Spielman et al (2020) paper is in turn a reproduction and reanalysis of:

Cutter, S. L., Boruff, B. J., & Shirley, W. L. (2003). Social vulnerability to environmental hazards. Social Science Quarterly, 84(2), 242–261. https://doi.org/10.1111/1540-6237.8402002

Spielman et al (2020) developed methods to evaluate the internal validity and construct validity of the Cutter, Boruff and Shirley (2003) Social Vulnerability Index (SoVI). First, they reproduce a national SoVI model and validate it against validated our SOVI calculation function against an SPSS procedure provided by the original research group (Hazards Vulnerability Research Institute at University of South Carolina). The original SoVI uses 42 independent z-score normalized variables from the U.S. Census, reduces the data to factors using Principal Components Analysis, selects the first XX factors, inverts factors with inverse relationships to social vulnerability, and sums the factors together, and calculates a z-score. The reproduced SoVI model was slightly different than the original model due to changes in U.S. Census data, using only 28 variables.

Spielman et al. modify the geographic extent of the SoVI calculation by recalculating SoVI for each of ten Federal Emergency Management Agency (FEMA) regions, and again for a single state or cluster of states within each of the ten regions, resulting in 21 total indices. Internal validity is assessed by calculating the spearman rank correlation coefficient of the SoVI score for counties in the state model compared to the FEMA region model and national model. Construct validity is assessed by summing the loadings for each input variable across the PCA factors in each model and calculating the variables sign (positive/negative) and the rank of the variable's total loading compared to the other variables. These signs and ranks are summarized across all 21 versions of the SoVI model with regard to the number of times the sign is different from the national model and the distributions of ranks.

In this reproduction study, we will attempt to reproduce identical SoVI model outputs by comparing to the outputs in the original GitHub repository, identical correlations between SoVI models as shown in Table 2, and identical reversals, and mean and range ranks as shown in Figure 2.

josephholler commented 12 months ago

Some other revision suggestions:

USA Counties shapefile

USA Counties Cartographic Boundaries

Other things

josephholler commented 12 months ago
josephholler commented 11 months ago
josephholler commented 11 months ago

@Liam-W-Smith Here are some suggested revisions to the final report:

Title

abstract

In this reproduction study, we attempt to reproduce identical SoVI model outputs for each of the 21 models in the original study. We will compare these outputs to data files in Spielman et al.'s GitHub repository. We will also attempt to reproduce identical results of internal consistency analysis (figure 1 and table 2) and construct validity analysis (figure 2) from Spielman et al.'s paper. We succeed in reproducing identical SoVI model outputs, but find slight discrepancies in our figures and tables.

The code in this Jupyter notebook report is adapted from Spielman et al's GitHub repository. The original study states the intended open source permissions in the acknowledgements: "To facilitate advances to current practice and to allow replication of our results, all of the code and data used in this analysis is open source and available at (https://github.com/geoss/sovi-validity). Funding was provided by the US National Science Foundation (Award No. 1333271) and the U.S. Geological Survey Land Change Science Program."

study design

RPr-H1: Reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.

RPr-H2: Reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.

RPr-H3: Reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.

We answer these questions by working through Spielman et al's code line by line in an updated python coding environment. To improve reproducibility, we reorganize Spielman's repository into the Template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences (doi:10.17605/OSF.IO/W29MQ) and use one Jupyter notebook for the reproduction report and code. We catalogue barriers to reproducibility and make improvements wherever possible.

data and variables

2010 Decennial Census metadata:

data transformations

I love the equivalency function!

A final step of data transformation will be performed at the beginning of the SoVI model analysis. Each demographic variable will be normalized by calculating its z-score.

Analysis

PCA

Internal consistency analysis

This analysis checks for consistent SoVI rankings of counties in a region of interest (a state or group of small states) through three versions of a SoVI model, each using a different geographic extent for input data. Those extents are: 1) all counties in the country, 2) all the counties in a FEMA region, and 3) all counties in a single state or group of small states. The SoVI scores for the counties in the region of interest are selected and ranked. The agreement between the three sets of rankings is calculated using the Spearman's Rho rank correlation coefficient. If the model is internally consistent, one could expect a nearly perfect positive rank correlation close to 1, implying that counties have similar levels of social vulnerability vis a vis one another in the region of interest, regardless of how much extraneous information from other counties in the FEMA region or from the whole United States has been included in the SoVI model.

Theoretical consistency Analysis

Results

Rephrase any questions as statements, e.g. about Hawaii and alaska missing data, after check_it: Our check it function found potential differences in 34 counties. The following table shows the SoVI scores and ranks for those counties.

These 34 counties are missing data in both the original study and our reproduction study. The counties and county equivalents are all located in Hawaii (FIPS code 15) and Alaska (FIPS code 02). According to Spielman et al.'s code, when they define...

Rpr-H1

First, we tested RPr-H1, that reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.

We define a function, check_it to check equivalency of the original output files to our reproduced output files.

RPr-H2 (add this prior to "State FEMA US Rank Correlations")

Next, we tested RPr-H2, that reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.

RPr-H3 (add this prior to "Table 2")

Finally, we tested RPr-H3, that reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.

Discussion

We have rejected Rpr-H1, finding that our reproductions of each of 21 SoVI models were identical to the original results, with the possible exception of a few minor changes in county rank caused by very slightly different calculations of land area and population density. The implication of this finding is that the codified procedures used in this reproduction study can reliably reproduce and replicate the SoVI model. Given our rejection of RPr-H1, we were surprised to have difficulty exactly reproducing RPr-H2 and RPr-H3. Although our results were very similar to figure 1 and figure 2, we did find a few discrepancies in each figure which we can only assume are related to the data visualization process in the original study, which was not automated in code.

josephholler commented 11 months ago

@Liam-W-Smith : This is looking good. the to-do list is getting very small:

doabell commented 11 months ago

My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.

There are several pandas options that we can experiment with. Try putting this code in the imports cell:

pd.set_option('display.expand_frame_repr', True)
pd.set_option("display.latex.repr", True)
pd.set_option("display.latex.longtable", True)

This generates a FutureWarning so disable these:

import warnings
warnings.filterwarnings('ignore')

The ACS table is too wide, so it overflows; the solution is to modify the LaTeX code (so nbconvert to LaTeX and then running LaTeX manually), or define a custom template for nbconvert to use.

Alternatives

Use the IPyPublish package and redefine pd:

from ipypublish import nb_setup
pd = nb_setup.setup_pandas(escape_latex = False)

This also generates a FutureWarning and possibly overflowing tables.

Or, install Quarto and run:

quarto render notebook.ipynb --to pdf

The PDF output is a bit different, and both tables and code blocks overflow.

josephholler commented 11 months ago

Thanks for identifying these options!


From: doabell @.> Sent: Thursday, August 3, 2023 1:14 PM To: HEGSRR/RPl-Spielman-2020 @.> Cc: Holler, Joseph @.>; Author @.> Subject: Re: [HEGSRR/RPl-Spielman-2020] Final Reproduction Study Revisions (Issue #3)

My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.

There are several pandas optionshttps://pandas.pydata.org/pandas-docs/version/1.2.3/user_guide/options.html#:~:text=settings%2C%20%5B%E2%80%98truncate%E2%80%99%2C%20%E2%80%98info%E2%80%99%5D-,display.latex.repr,-False that we can experiment with. Try putting this code in the imports cell:

pd.set_option('display.expand_frame_repr', True) pd.set_option("display.latex.repr", True) pd.set_option("display.latex.longtable", True)

This generates a FutureWarning so disable these:

import warnings warnings.filterwarnings('ignore')

The ACS table is too wide, so it overflows; the solution is to modifyhttps://stackoverflow.com/q/60729821 the LaTeX code (so nbconvert to LaTeX and then running LaTeX manually), or define a custom templatehttps://stackoverflow.com/a/52502092 for nbconvert to use.

Alternatives

Use the IPyPublishhttps://ipypublish.readthedocs.io/en/latest/ package and redefine pd:

from ipypublish import nb_setup pd = nb_setup.setup_pandas(escape_latex = False)

This also generates a FutureWarning and possibly overflowing tables.

Or, install Quartohttps://quarto.org/docs/get-started/ and run:

quarto render notebook.ipynb --to pdf

The PDF output is a bit different, and both tables and code blocks overflow.

— Reply to this email directly, view it on GitHubhttps://github.com/HEGSRR/RPl-Spielman-2020/issues/3#issuecomment-1664346273, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMTA72FBLFI3XZ5L7VNABDXTPL6DANCNFSM6AAAAAA2QJPEHQ. You are receiving this because you authored the thread.Message ID: @.***>

josephholler commented 11 months ago

The reproduction report has been registered! Table formatting is still an open question, but best practice is likely to use Quarto as recommended by Yifei. Otherwise an option is to render a LaTeX and edit further from there. I can see why some workflows recommend simply rendering as HTML first and then generating a PDF from there-- it is a simple way to get a reasonably nice report.