fsingletonthorn / EffectSizeScraping

MIT License
1 stars 0 forks source link

html tag removal deletes portions of text contained within < > even when not a html tag #23

Closed fsingletonthorn closed 5 years ago

fsingletonthorn commented 5 years ago

E.g., for "https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:5504157&metadataPrefix=pmc" the section reading "Additionally, to analyze stability at the level of the individual, we intercorrelated all variables of t1 with their counterparts at t2. Correlation coefficients were high and, without exception, statistically significant (p < 0.001). Intercorrelations of character strengths at t1 and t2 ranged fromr = 0.56 (authenticity) to r = 0.86 (spirituality), and those of wellbeing aspects from r = 0.32 (autonomy) to r = 0.75 (PWB)."

doesn't appear to be pulled

fsingletonthorn commented 5 years ago

# Remove HTML tags: strings <- lapply(strings, gsub, pattern = "<(.|\n)*?>", replacement = "")

Probably rarely causes issues, although can if people are reporting p values as >/<, e.g. "t(130) = 12.4, p < .05 and t(130) = 0.13, p > .05", which would get stripped to "t(130) = 12.4, p .05"