`ESEXtractor::`

This is a package for the automated extraction of test statistics, effect sizes and reporting practices from scientific articles. It has specialized functions for downloading and processing articles from the PubMed Central open access subset, along with meta-data and bibliometric information.

Installation

You can install ESExtractor:: from github with:

# install.packages("devtools")
devtools::install_github("fsingletonthorn/EffectSizeScraping")

Usage

`extractTestStats()`

This package’s main function is extractTestStats(). extractTestStats() is designed to extract test statistics, degrees of freedom and p values for reported F tests, Chi Square Tests, t tests, correlational tests, as well as standardised effect sizes reported as Cohen’s d values, eta squared values, odds ratios, hazard ratios, and correlations.

extractTestStats("A cohen's d of .1, F(1, 123) = 12.2, p < .05, t(122) = 2.99, p = .049, χ2(1, N = 320) = 22.31, p < 0.001")

statistic	reported	df1	df2	p	value	n
t	t(122) = 2.99, p = .049	NA	122	p = .049	2.99	NA
chi	X2(1, N = 320) = 22.31, p \< 0.001	NA	1	p \< 0.001	22.31	320
F	F(1, 123) = 12.2, p \< .05	1	123	p \< .05	12.20	NA
d	d of .1, p = .43	NA	NA	p = .43	0.10	NA

If you want to additionally extract the context around each reported result, you can specify “context = TRUE” and the number of characters of context you wish to extract on either side (e.g., “contextSize = 100”, the default is 300 characters).

Example

extractTestStats("Across all 200 sessions, the hit rate was in the predicted direction but not significantly different from chance, 49.1%, t(199) = 1.31, p = .096, d = 0.09. (I now wish I had simply continued to use subliminal exposures.) Nevertheless, stimulus seeking was again positively correlated with psi performance (lower hit rates).", context = TRUE, contextSize = 100)

statistic	reported	df1	df2	p	value	n	context
t	t(199) = 1.31, p = .096	NA	199	p = .096	1.31	NA	ns, the hit rate was in the predicted direction but not significantly different from chance, 49.1%, t(199) = 1.31, p = .096. (I now wish I had simply continued to use subliminal exposures.) Nevertheless, stimulus seeking wa

Specific extractor functions

If you want to extract only one type of statistical test or effect sizes, you can use the specific extractor functions (extractChiSquare() for Chi Square tests, extractTTests() for t tests, extractFTests() for F tests, extractCorrelations() for correlations and correlational tests, and extractES() for odds ratios, hazard ratios, Eta squared values and Cohen’s d values). The functionality to extract context is not available when using the specific extractor functions.

Example

extractTTests("Across all 200 sessions, the hit rate was in the predicted direction but not significantly different from chance, 49.1%, t = 1.31, p = .096, d = 0.09, F(1,232)=12.3, p < .001 (I now wish I had simply continued to use subliminal exposures.) Nevertheless, stimulus seeking was again positively correlated with psi performance (lower hit rates).")

statistic	reported	df1	df2	p	value
t	t = 1.31, p = .096	NA	NA	p = .096	1.31

pullPMC

This package also includes a function to extract text from articles on the PubMed Central Open Access Subset. To use this function, pass a PubMed Central Open Access Subset OAI-PMH service full XML text URL to the pullPMC function (see https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ for information on this service and https://www.ncbi.nlm.nih.gov/pmc/tools/oai/ for information on how to request the full text in XML format). This function returns a list containing “metadata”, any keywords associated with the file, a list of authors, and the full text of the article separated into the sections as labelled in the XML file (or, if any text was not labelled, marked as “unlabeled”).

All files are presented in tidydata format, keyed using the PMCID, the unique identifier used by the PubMed Central database. The metadata returned includes the PMCID, DOI, the journal name, an abbreviated journal name, the title of the record, the record’s issue number, the volume of the journal the article was included in, the print publication date, the electronic publication date, and the call used to request the XML file.

Example


PMC_results <- pullPMC("https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:3659440&metadataPrefix=pmc")

PMC_results$metadata

PMCID	doi	journalID	journalIDAbrev	title	issue	volume	pPub	ePub	call
PMC3659440	10.1155/2013/649875	Rehabilitation Research and Practice	Rehabil Res Pract	Healing Pathways: A Program for Women with Physical Disabilities and Depression	NA	2013	2013-01-01	2013-05-02	https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:3659440&metadataPrefix=pmc

Experimental features

There are several undocumented experimental features. If you want to use these functions please be aware that they have not been extensively tested.

Sample size extractor

Using the package words_to_numbers findN converts all numbers written as words to numerics and looks for phrases that report the sample sizes (e.g., “We included nine hundred and twelve participants” or “12 student participants”, “N = 12”, “participants were one hundred and 12 student volunteers”).

Example

findN("We included nine hundred and twelve participants")

string	N
912 participants	912

findN("N = 12")

string	N
N = 12	12

findN("participants were one hundred and 12 student vulunteers")

string	N
participants were 112	112

CI extractor

checkCIs() checks input text for reported confidence intervals. checkCIs() accepts either a fully written out “confidence interval” as a hit, or “[digits]% CI”, or “CI” followed by just two numbers separated by commas in parentheses “(digit, digit)” or square brackets “[digit, digit]” (ignores white-space). Context is extracted if the argument “context” is set to TRUE, and the number of characters extracted is set using the argument “contextSize”.

checkCIs("the world is round, p < .05, d = 1.1, 95% CI [.9, 1.2], although some people may not accept this fact", context = TRUE, contextSize = 25)

CIs	context	CIBinary
95% CI	round, p \< .05, d = 1.1, 95% CI [.9, 1.2], although some	TRUE

checkCIs("The effect was 95% effective")

CIs	context	CIBinary
NA	NA	FALSE

fsingletonthorn / EffectSizeScraping

readme

`ESEXtractor::`

Installation

Usage

`extractTestStats()`

Example

Specific extractor functions

Example

pullPMC

Example

Experimental features

Sample size extractor

Example

CI extractor