ScienceCommons / curate_science

Transparency & credibility curation products for all research stakeholders.
https://CurateScience.org
MIT License
13 stars 6 forks source link

web scrape article & T metadata from Collabra, JofCognition, and PSCI #93

Closed eplebel closed 4 years ago

eplebel commented 4 years ago

LINK to v1 of preliminary python web scrapers of Transparency metadata: https://github.com/dominklenda/Web-Scraping-Curate-Science-Platform-

the idea is to scrape as much transparency metadata information as possible for all articles published in 3 of the most transparent journals in the world (and dump the article metadata into a Google doc ("CS-scraped-transparent-articles"; one sheet for each journal; metadata fields are in the order i'll be inputting them in the Curate Science article editor form).

from there i'll hand-pick the most transparent articles (relative to impact metrics), and then add them to CS manually (noting which articles have been added, in case we want to add the other ones later). in the future, we'll be able do this more programmatically once we have the batch import functionality implemented; see prototype).

  1. Collabra: Psychology: 5 volumes (vol. 1 URL), data accessibility statements start at volume 4

    Metadata to extract (from XML and HTML): • DOI, article type (""), abstract text (""), keywords (""), # of views, # of downloads, "Conflict of Interest Declaration" (called "Competing interests" after volume 2), open peer review URL ("Peer review comments URL"), article HTML URL, article PDF URL, "Acknowledgments" statement (which sometimes contains URLs to data or funding statements for article before volume 4), "Data accessibility statements" (extract all URLs), "Funding Informations" (starting at volume 4), and "Authors Contributions" (starting at volume 4).

  2. Journal of Cognition: 2 volumes; data accessibility statements start sometime in volume 1

    Metadata to extract (from XML and HTML): • DOI, article type (""), abstract text (""), keywords (""), # of views, # of downloads, "Competing interests", "Materials" (extract all URLs to study materials; only for some Volume 1 articles), article HTML URL, article PDF URL, "Acknowledgments" statement (which sometimes contains URLs to data or funding statements), "Data accessibility statements" (extract all URLs), "Funding Information" (notice no "s" in 'Information'), and "Authors Contributions"(?), "Additional File" (extract all URLs to supplementary materials).

  3. Psychological Science: all articles since badges started in 2014 that have an "Open Practices" statement, i.e., starting at Volume 25 Issue 5, May 2014 (though only 1 article in that issue); 0 in next issue; 6 articles in Issue 7)

    Metadata to extract (from PDF and HTML): • DOI, article type (don't think this is available), abstract text, "Keywords", # of downloads (from an article's "metrics" subpage; example), "Declaration of Conflicting Interests", article HTML URL (only if open access), article PDF URL (sci-hub URL), "Open Practices" statement (extract all URLs and save in open.content.URLs field), "Funding", and "Authors Contributions".

for all journals, you only need to extract article title and year into the Google doc, because our DOI LOOKUP crossref functionality already automatically retrieves article title, authors, year, journal name, and citations (Web of Science).

eplebel commented 4 years ago

great job on this impressive task @dominklenda!