Closed josephholler closed 11 months ago
Some other revision suggestions:
USA Counties shapefile
USA Counties Cartographic Boundaries
Other things
procedures
folder and name it something like RPr-Spielman-2020
This will simplify the paths to the environment
folder and allow them to be consistent with the template. Then you can make a copy of that notebook and rename it RPl-Spielman-2020
for the replication!@Liam-W-Smith Here are some suggested revisions to the final report:
[x] change to match title in other places:
Reproduction of Spielman et al's 2020 Evaluation of the Social Vulnerability Index
In this reproduction study, we attempt to reproduce identical SoVI model outputs for each of the 21 models in the original study. We will compare these outputs to data files in Spielman et al.'s GitHub repository. We will also attempt to reproduce identical results of internal consistency analysis (figure 1 and table 2) and construct validity analysis (figure 2) from Spielman et al.'s paper. We succeed in reproducing identical SoVI model outputs, but find slight discrepancies in our figures and tables.
The code in this Jupyter notebook report is adapted from Spielman et al's GitHub repository. The original study states the intended open source permissions in the acknowledgements: "To facilitate advances to current practice and to allow replication of our results, all of the code and data used in this analysis is open source and available at (https://github.com/geoss/sovi-validity). Funding was provided by the US National Science Foundation (Award No. 1333271) and the U.S. Geological Survey Land Change Science Program."
[x] in first sentence, mention using a reproducible research compendium template
[x] possible working hypotheses for reproduction, in place of RPr-Q1:
RPr-H1: Reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.
RPr-H2: Reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.
RPr-H3: Reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.
We answer these questions by working through Spielman et al's code line by line in an updated python coding environment. To improve reproducibility, we reorganize Spielman's repository into the Template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences (doi:10.17605/OSF.IO/W29MQ) and use one Jupyter notebook for the reproduction report and code. We catalogue barriers to reproducibility and make improvements wherever possible.
2010 Decennial Census metadata:
[x] replace bold "standard metadata" with another bullet above "abstract" for title:
Title
: 2010 Decennial Census
[x] and do same for USA Counties Shapefile
Title
: USA Counties Geographic Shapefile
[x] Actually, you might as well make these their own .md files in data/metadata and write them into the notebook the same way you did for ACS 2012 geographic metadata. Include the variables information in the .md since it's only a couple of variables for each of these additional data sources.
[x] Then add those files to the data_metadata.csv list.
I love the equivalency function!
[x] Can you add a bit of code to step P4 to print the variable names being inverted? This could be embedded in the if sign == 'neg':
block of code.
[x] Then, add a two-sentence paragraph:
A final step of data transformation will be performed at the beginning of the SoVI model analysis. Each demographic variable will be normalized by calculating its z-score.
This analysis checks for consistent SoVI rankings of counties in a region of interest (a state or group of small states) through three versions of a SoVI model, each using a different geographic extent for input data. Those extents are: 1) all counties in the country, 2) all the counties in a FEMA region, and 3) all counties in a single state or group of small states. The SoVI scores for the counties in the region of interest are selected and ranked. The agreement between the three sets of rankings is calculated using the Spearman's Rho rank correlation coefficient. If the model is internally consistent, one could expect a nearly perfect positive rank correlation close to 1, implying that counties have similar levels of social vulnerability vis a vis one another in the region of interest, regardless of how much extraneous information from other counties in the FEMA region or from the whole United States has been included in the SoVI model.
[x] Similar to the comment above, add one paragraph to explain in plain language what we are doing here.
[x] note since you've saved all the other results down in the results section, you may as well calculate the figure 2 summary but not show it until results, where it shows up again.
check_it('US_Sovi_Score.csv')
you can simply state: "We have identically reproduced national SoVI scores for all 3143 counties compared to the original study, but two county ranks are different."Rephrase any questions as statements, e.g. about Hawaii and alaska missing data, after
check_it
: Our check it function found potential differences in 34 counties.
The following table shows the SoVI scores and ranks for those counties.
These 34 counties are missing data in both the original study and our reproduction study. The counties and county equivalents are all located in Hawaii (FIPS code 15) and Alaska (FIPS code 02). According to Spielman et al.'s code, when they define...
First, we tested RPr-H1, that reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.
We define a function, check_it to check equivalency of the original output files to our reproduced output files.
Next, we tested RPr-H2, that reproduced map visualizations of SoVI model results for California are not identical to the map visualizations shown in figure 1 of the original study.
Finally, we tested RPr-H3, that reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.
We have rejected Rpr-H1, finding that our reproductions of each of 21 SoVI models were identical to the original results, with the possible exception of a few minor changes in county rank caused by very slightly different calculations of land area and population density. The implication of this finding is that the codified procedures used in this reproduction study can reliably reproduce and replicate the SoVI model. Given our rejection of RPr-H1, we were surprised to have difficulty exactly reproducing RPr-H2 and RPr-H3. Although our results were very similar to figure 1 and figure 2, we did find a few discrepancies in each figure which we can only assume are related to the data visualization process in the original study, which was not automated in code.
[x] In addition to checking the original study results, a major aim of this reproduction study was to improve its computational reproducibility. With all the necessary data and code... (go on with the rest of the discussion section)
[x] Can I suggest changing "Missing Code" to "Incomplete Code" or "Partial Code" or something similar?
@Liam-W-Smith : This is looking good. the to-do list is getting very small:
style
class may help: https://pandas.pydata.org/docs/user_guide/style.html# In R, I've been using kable
to address this issue.
- [x] also asking @doabell to look into tidying table output from Jupyter notebooks. Pandas
style
class may help: https://pandas.pydata.org/docs/user_guide/style.html# In R, I've been usingkable
to address this issue.
My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.
There are several pandas options that we can experiment with. Try putting this code in the imports cell:
pd.set_option('display.expand_frame_repr', True)
pd.set_option("display.latex.repr", True)
pd.set_option("display.latex.longtable", True)
This generates a FutureWarning
so disable these:
import warnings
warnings.filterwarnings('ignore')
The ACS table is too wide, so it overflows; the solution is to modify the LaTeX code (so nbconvert
to LaTeX and then running LaTeX manually), or define a custom template for nbconvert
to use.
Use the IPyPublish package and redefine pd
:
from ipypublish import nb_setup
pd = nb_setup.setup_pandas(escape_latex = False)
This also generates a FutureWarning
and possibly overflowing tables.
Or, install Quarto and run:
quarto render notebook.ipynb --to pdf
The PDF output is a bit different, and both tables and code blocks overflow.
Thanks for identifying these options!
From: doabell @.> Sent: Thursday, August 3, 2023 1:14 PM To: HEGSRR/RPl-Spielman-2020 @.> Cc: Holler, Joseph @.>; Author @.> Subject: Re: [HEGSRR/RPl-Spielman-2020] Final Reproduction Study Revisions (Issue #3)
My laptop is giving me "file not found" and "missing $" errors, so I don't have a PDF ready.
There are several pandas optionshttps://pandas.pydata.org/pandas-docs/version/1.2.3/user_guide/options.html#:~:text=settings%2C%20%5B%E2%80%98truncate%E2%80%99%2C%20%E2%80%98info%E2%80%99%5D-,display.latex.repr,-False that we can experiment with. Try putting this code in the imports cell:
pd.set_option('display.expand_frame_repr', True) pd.set_option("display.latex.repr", True) pd.set_option("display.latex.longtable", True)
This generates a FutureWarning so disable these:
import warnings warnings.filterwarnings('ignore')
The ACS table is too wide, so it overflows; the solution is to modifyhttps://stackoverflow.com/q/60729821 the LaTeX code (so nbconvert to LaTeX and then running LaTeX manually), or define a custom templatehttps://stackoverflow.com/a/52502092 for nbconvert to use.
Alternatives
Use the IPyPublishhttps://ipypublish.readthedocs.io/en/latest/ package and redefine pd:
from ipypublish import nb_setup pd = nb_setup.setup_pandas(escape_latex = False)
This also generates a FutureWarning and possibly overflowing tables.
Or, install Quartohttps://quarto.org/docs/get-started/ and run:
quarto render notebook.ipynb --to pdf
The PDF output is a bit different, and both tables and code blocks overflow.
— Reply to this email directly, view it on GitHubhttps://github.com/HEGSRR/RPl-Spielman-2020/issues/3#issuecomment-1664346273, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMTA72FBLFI3XZ5L7VNABDXTPL6DANCNFSM6AAAAAA2QJPEHQ. You are receiving this because you authored the thread.Message ID: @.***>
The reproduction report has been registered! Table formatting is still an open question, but best practice is likely to use Quarto as recommended by Yifei. Otherwise an option is to render a LaTeX and edit further from there. I can see why some workflows recommend simply rendering as HTML first and then generating a PDF from there-- it is a simple way to get a reasonably nice report.
environment
folder? Can the template.ipynb
file also store environment information in the environment folder?Abstract
The Social Vulnerability Index
The Spielman et al (2020) paper is in turn a reproduction and reanalysis of:
Spielman et al (2020) developed methods to evaluate the internal validity and construct validity of the Cutter, Boruff and Shirley (2003) Social Vulnerability Index (SoVI). First, they reproduce a national SoVI model and validate it against validated our SOVI calculation function against an SPSS procedure provided by the original research group (Hazards Vulnerability Research Institute at University of South Carolina). The original SoVI uses 42 independent z-score normalized variables from the U.S. Census, reduces the data to factors using Principal Components Analysis, selects the first XX factors, inverts factors with inverse relationships to social vulnerability, and sums the factors together, and calculates a z-score. The reproduced SoVI model was slightly different than the original model due to changes in U.S. Census data, using only 28 variables.
Spielman et al. modify the geographic extent of the SoVI calculation by recalculating SoVI for each of ten Federal Emergency Management Agency (FEMA) regions, and again for a single state or cluster of states within each of the ten regions, resulting in 21 total indices. Internal validity is assessed by calculating the spearman rank correlation coefficient of the SoVI score for counties in the state model compared to the FEMA region model and national model. Construct validity is assessed by summing the loadings for each input variable across the PCA factors in each model and calculating the variables sign (positive/negative) and the rank of the variable's total loading compared to the other variables. These signs and ranks are summarized across all 21 versions of the SoVI model with regard to the number of times the sign is different from the national model and the distributions of ranks.
In this reproduction study, we will attempt to reproduce identical SoVI model outputs by comparing to the outputs in the original GitHub repository, identical correlations between SoVI models as shown in Table 2, and identical reversals, and mean and range ranks as shown in Figure 2.