[x] delete 01-r_script.r
[x] can ENV.yml / ENV2.yml / req.txt be saved to our environment folder? Can the template .ipynb file also store environment information in the environment folder?
[x] delete extra jupyter notebooks / python code
[x] should we add pycache to the .gitignore for this repo and the template and delete that folder from this repo?
[x] My ORCID: https://orcid.org/0000-0002-2381-2699
[x] Temporal resolution: 5-year average

Abstract

The Social Vulnerability Index

[x] Let's add a second indented citation as follows:

The Spielman et al (2020) paper is in turn a reproduction and reanalysis of:

Cutter, S. L., Boruff, B. J., & Shirley, W. L. (2003). Social vulnerability to environmental hazards. Social Science Quarterly, 84(2), 242–261. https://doi.org/10.1111/1540-6237.8402002

[x] Suggestion for abstract:

Spielman et al (2020) developed methods to evaluate the internal validity and construct validity of the Cutter, Boruff and Shirley (2003) Social Vulnerability Index (SoVI). First, they reproduce a national SoVI model and validate it against validated our SOVI calculation function against an SPSS procedure provided by the original research group (Hazards Vulnerability Research Institute at University of South Carolina). The original SoVI uses 42 independent z-score normalized variables from the U.S. Census, reduces the data to factors using Principal Components Analysis, selects the first XX factors, inverts factors with inverse relationships to social vulnerability, and sums the factors together, and calculates a z-score. The reproduced SoVI model was slightly different than the original model due to changes in U.S. Census data, using only 28 variables.

Spielman et al. modify the geographic extent of the SoVI calculation by recalculating SoVI for each of ten Federal Emergency Management Agency (FEMA) regions, and again for a single state or cluster of states within each of the ten regions, resulting in 21 total indices. Internal validity is assessed by calculating the spearman rank correlation coefficient of the SoVI score for counties in the state model compared to the FEMA region model and national model. Construct validity is assessed by summing the loadings for each input variable across the PCA factors in each model and calculating the variables sign (positive/negative) and the rank of the variable's total loading compared to the other variables. These signs and ranks are summarized across all 21 versions of the SoVI model with regard to the number of times the sign is different from the national model and the distributions of ranks.

In this reproduction study, we will attempt to reproduce identical SoVI model outputs by comparing to the outputs in the original GitHub repository, identical correlations between SoVI models as shown in Table 2, and identical reversals, and mean and range ranks as shown in Figure 2.

Some other revision suggestions:

[x] You could streamline the code by loading a CSV with the acs_variables, spielman_acs_variables, and alias
[x] is there an option to wrap code in juypter code blocks?
[x] I think it'd be OK to split the layer metadata into two files: geographic (everything up to the 'variables' and 'data_dictionary'). Then you could load the geographic metadata in from .md file to print in the notebook and avoid writing or saving it in the notebook. The 'variables' field could then just say "see data dictionary"

USA Counties shapefile

[x] temporal coverage: this is important, because boundaries change. too bad we don't know, other than it must be prior to when the file was committed.
[x] data quality: normally these shapefiles are designed for a particular scale ratio. without documentation, we could guess... or just say that it was not documented. It might be pretty obvious based on file size alone

USA Counties Cartographic Boundaries

[x] data quality: if counties is a function, it may have a default option for the scale. see scales here: https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

Other things

[x] rounding the land area to the nearest tenth would have fixed the 3 non-null errors. would it also make more errors?, e.g. in calculating pop density?
[x] the workflow diagrams are really clear. The procedure to calc SOVI and outputs is essentially identical for FEMA regions and states as it was for National-level-- if you changed the first "select" to a generic description, you could probably just use the one workflow.
[x] the mismatched SoVI scores are all -4.53, suggesting that unless differences in the hundredths place is driving the different rank, then it might just be a difference in assigning a rank when there is a tie.
[x] FEMA Region SoVI Scores & Rankings: edit "This results was successfully reproduced." to "these results" or "this result"
[x] confusing file system: change data_metadata.csv to procedure_metadata.csv to more closely match the example
[x] conclusions: also note that a fully executable research compendium would reduce the risk of transcription errors.
[x] metadata lineage: let's replace the text link to the original repository with the URL to the repository, so that the source will be clear even if the report is printed or stored in a system with no hyperlinks

[x] Since now there is just one notebook for the reproduction study, you might as well move it to the root of the procedures folder and name it something like RPr-Spielman-2020 This will simplify the paths to the environment folder and allow them to be consistent with the template. Then you can make a copy of that notebook and rename it RPl-Spielman-2020 for the replication!

[x] proofread report (Joe)
[x] proofread report (Liam)
[x] save PDF version of report to docs folder
[x] Create project on OSF
[x] Link to GitHub
[x] Register report
[x] Update top-level github readme with funding, OSF links
[ ] Release GitHub version

@Liam-W-Smith Here are some suggested revisions to the final report:

Title

[x] change to match title in other places:
Reproduction of Spielman et al's 2020 Evaluation of the Social Vulnerability Index

abstract

[x] edit final paragraph:

In this reproduction study, we attempt to reproduce identical SoVI model outputs for each of the 21 models in the original study. We will compare these outputs to data files in Spielman et al.'s GitHub repository. We will also attempt to reproduce identical results of internal consistency analysis (figure 1 and table 2) and construct validity analysis (figure 2) from Spielman et al.'s paper. We succeed in reproducing identical SoVI model outputs, but find slight discrepancies in our figures and tables.

[x] Add a statement:

The code in this Jupyter notebook report is adapted from Spielman et al's GitHub repository. The original study states the intended open source permissions in the acknowledgements: "To facilitate advances to current practice and to allow replication of our results, all of the code and data used in this analysis is open source and available at (https://github.com/geoss/sovi-validity). Funding was provided by the US National Science Foundation (Award No. 1333271) and the U.S. Geological Survey Land Change Science Program."

study design

[x] in first sentence, mention using a reproducible research compendium template
[x] possible working hypotheses for reproduction, in place of RPr-Q1:

RPr-H1: Reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.

RPr-H2: Reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.

RPr-H3: Reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.

[x] state Rpr-Q2 as an additional goal in text. e.g.:

We answer these questions by working through Spielman et al's code line by line in an updated python coding environment. To improve reproducibility, we reorganize Spielman's repository into the Template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences (doi:10.17605/OSF.IO/W29MQ) and use one Jupyter notebook for the reproduction report and code. We catalogue barriers to reproducibility and make improvements wherever possible.

data and variables

[x] split code block 5 into two so that notebook runs seamlessly without Census API: 1) download and save raw data (make this SKIP by default) 2) load raw data from repository (make this RUN by default)

2010 Decennial Census metadata:

[x] replace bold "standard metadata" with another bullet above "abstract" for title:
Title: 2010 Decennial Census
[x] and do same for USA Counties Shapefile
Title: USA Counties Geographic Shapefile
[x] Actually, you might as well make these their own .md files in data/metadata and write them into the notebook the same way you did for ACS 2012 geographic metadata. Include the variables information in the .md since it's only a couple of variables for each of these additional data sources.
[x] Then add those files to the data_metadata.csv list.

data transformations

I love the equivalency function!

[x] Can you add a bit of code to step P4 to print the variable names being inverted? This could be embedded in the if sign == 'neg': block of code.
[x] Then, add a two-sentence paragraph:

A final step of data transformation will be performed at the beginning of the SoVI model analysis. Each demographic variable will be normalized by calculating its z-score.

Analysis

PCA

[x] Mention that this PCA is identical to the PCA used for the SoVI model, which was implemented in SPSS software.

Internal consistency analysis

[x] Answer the question: What does the internal consistency analysis do? This could be the answer:

This analysis checks for consistent SoVI rankings of counties in a region of interest (a state or group of small states) through three versions of a SoVI model, each using a different geographic extent for input data. Those extents are: 1) all counties in the country, 2) all the counties in a FEMA region, and 3) all counties in a single state or group of small states. The SoVI scores for the counties in the region of interest are selected and ranked. The agreement between the three sets of rankings is calculated using the Spearman's Rho rank correlation coefficient. If the model is internally consistent, one could expect a nearly perfect positive rank correlation close to 1, implying that counties have similar levels of social vulnerability vis a vis one another in the region of interest, regardless of how much extraneous information from other counties in the FEMA region or from the whole United States has been included in the SoVI model.

Theoretical consistency Analysis

[x] Similar to the comment above, add one paragraph to explain in plain language what we are doing here.
[x] note since you've saved all the other results down in the results section, you may as well calculate the figure 2 summary but not show it until results, where it shows up again.

Results

[x] Organize the whole section by the three hypotheses, and add at least one sentence before code to state what will be tested and after code blocks to interpret the result. For example, after check_it('US_Sovi_Score.csv') you can simply state: "We have identically reproduced national SoVI scores for all 3143 counties compared to the original study, but two county ranks are different."

Rephrase any questions as statements, e.g. about Hawaii and alaska missing data, after check_it: Our check it function found potential differences in 34 counties. The following table shows the SoVI scores and ranks for those counties.

These 34 counties are missing data in both the original study and our reproduction study. The counties and county equivalents are all located in Hawaii (FIPS code 15) and Alaska (FIPS code 02). According to Spielman et al.'s code, when they define...

Rpr-H1

First, we tested RPr-H1, that reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.

We define a function, check_it to check equivalency of the original output files to our reproduced output files.

RPr-H2 (add this prior to "State FEMA US Rank Correlations")

Next, we tested RPr-H2, that reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.

RPr-H3 (add this prior to "Table 2")

Finally, we tested RPr-H3, that reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.

Discussion

[x] After the first paragraph, discuss implications of the first three hypotheses.

We have rejected Rpr-H1, finding that our reproductions of each of 21 SoVI models were identical to the original results, with the possible exception of a few minor changes in county rank caused by very slightly different calculations of land area and population density. The implication of this finding is that the codified procedures used in this reproduction study can reliably reproduce and replicate the SoVI model. Given our rejection of RPr-H1, we were surprised to have difficulty exactly reproducing RPr-H2 and RPr-H3. Although our results were very similar to figure 1 and figure 2, we did find a few discrepancies in each figure which we can only assume are related to the data visualization process in the original study, which was not automated in code.

[x] In addition to checking the original study results, a major aim of this reproduction study was to improve its computational reproducibility. With all the necessary data and code... (go on with the rest of the discussion section)
[x] Can I suggest changing "Missing Code" to "Incomplete Code" or "Partial Code" or something similar?

@Liam-W-Smith : This is looking good. the to-do list is getting very small:

[x] Now that you've got a system working to make pdf's, try placing the title in the jupyter notebook metadata as in this issue comment: https://github.com/jupyter/nbconvert/issues/249#issuecomment-500563429 Then you might not need the title in header text anymore.
[x] can you change "we did not anticipate conducting a reproduction study." to "we assumed that the computational reproduction would be trivial."
[x] any time you have "et al" it should end with a period: et al. because it's an abbreviation.
[x] also asking @doabell to look into tidying table output from Jupyter notebooks. Pandas style class may help: https://pandas.pydata.org/docs/user_guide/style.html# In R, I've been using kable to address this issue.

[x] also asking @doabell to look into tidying table output from Jupyter notebooks. Pandas style class may help: https://pandas.pydata.org/docs/user_guide/style.html# In R, I've been using kable to address this issue.

My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.

There are several pandas options that we can experiment with. Try putting this code in the imports cell:

pd.set_option('display.expand_frame_repr', True)
pd.set_option("display.latex.repr", True)
pd.set_option("display.latex.longtable", True)

This generates a FutureWarning so disable these:

import warnings
warnings.filterwarnings('ignore')

The ACS table is too wide, so it overflows; the solution is to modify the LaTeX code (so nbconvert to LaTeX and then running LaTeX manually), or define a custom template for nbconvert to use.

Alternatives

Use the IPyPublish package and redefine pd:

from ipypublish import nb_setup
pd = nb_setup.setup_pandas(escape_latex = False)

This also generates a FutureWarning and possibly overflowing tables.

Or, install Quarto and run:

quarto render notebook.ipynb --to pdf

The PDF output is a bit different, and both tables and code blocks overflow.

Thanks for identifying these options!

From: doabell @.> Sent: Thursday, August 3, 2023 1:14 PM To: HEGSRR/RPl-Spielman-2020 @.> Cc: Holler, Joseph @.>; Author @.> Subject: Re: [HEGSRR/RPl-Spielman-2020] Final Reproduction Study Revisions (Issue #3)

also asking @doabellhttps://github.com/doabell to look into tidying table output from Jupyter notebooks. Pandas style class may help: https://pandas.pydata.org/docs/user_guide/style.html# In R, I've been using kable to address this issue.

My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.

There are several pandas optionshttps://pandas.pydata.org/pandas-docs/version/1.2.3/user_guide/options.html#:~:text=settings%2C%20%5B%E2%80%98truncate%E2%80%99%2C%20%E2%80%98info%E2%80%99%5D-,display.latex.repr,-False that we can experiment with. Try putting this code in the imports cell:

pd.set_option('display.expand_frame_repr', True) pd.set_option("display.latex.repr", True) pd.set_option("display.latex.longtable", True)

This generates a FutureWarning so disable these:

import warnings warnings.filterwarnings('ignore')

The ACS table is too wide, so it overflows; the solution is to modifyhttps://stackoverflow.com/q/60729821 the LaTeX code (so nbconvert to LaTeX and then running LaTeX manually), or define a custom templatehttps://stackoverflow.com/a/52502092 for nbconvert to use.

Alternatives

Use the IPyPublishhttps://ipypublish.readthedocs.io/en/latest/ package and redefine pd:

from ipypublish import nb_setup pd = nb_setup.setup_pandas(escape_latex = False)

This also generates a FutureWarning and possibly overflowing tables.

Or, install Quartohttps://quarto.org/docs/get-started/ and run:

quarto render notebook.ipynb --to pdf

The PDF output is a bit different, and both tables and code blocks overflow.

— Reply to this email directly, view it on GitHubhttps://github.com/HEGSRR/RPl-Spielman-2020/issues/3#issuecomment-1664346273, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMTA72FBLFI3XZ5L7VNABDXTPL6DANCNFSM6AAAAAA2QJPEHQ. You are receiving this because you authored the thread.Message ID: @.***>

The reproduction report has been registered! Table formatting is still an open question, but best practice is likely to use Quarto as recommended by Yifei. Otherwise an option is to render a LaTeX and edit further from there. I can see why some workflows recommend simply rendering as HTML first and then generating a PDF from there-- it is a simple way to get a reasonably nice report.

HEGSRR / RPl-Spielman-2020

Final Reproduction Study Revisions #3

Abstract

Title

abstract

study design

data and variables

data transformations

Analysis

PCA

Internal consistency analysis

Theoretical consistency Analysis

Results

Rpr-H1

RPr-H2 (add this prior to "State FEMA US Rank Correlations")

RPr-H3 (add this prior to "Table 2")

Discussion

Alternatives