Closed bgcarlisle closed 1 year ago
Ideally, if you could pull out doi
and pmid
, that would be superb. See https://aact.ctti-clinicaltrials.org/data_dictionary under Table = Study_References. This also have reference_type
which is useful (i.e., background, result), but I'm not sure where that's encoded on the web interface.
I'm looking through the HTML on the historical version and the reference_type
doesn't seem to be anywhere :(
Do you have an example handy of an NCT number with more than one reference?
Here's one! https://clinicaltrials.gov/ct2/show/NCT04315480
This has both "Publications:" and "Publications automatically indexed to this study by ClinicalTrials.gov Identifier (NCT Number):" and it would be good to have both, but I'm not seeing the latter on the history page which is odd.
Odd about reference_type
but true it's not in the main page either.
Okay here's a working version https://github.com/bgcarlisle/cthist/commit/6ca4f16d50b73a16b54a5f46fadee5002ad9fb1e
It downloads a JSON-encoded table from the references section with the doi and pmid extracted into separate columns
Thanks! This is for ctgov right now and the publication links from drks are in #10 .
Should I test now or after both registries?
Feel free to test now, still working on the DRKS.
If you do the following:
library(tidyverse)
library(jsonlite)
library(cthist)
clinicaltrials_gov_download("NCT02110043") %>%
select(references) %>%
slice_tail() %>%
pull() %>%
fromJSON() %>%
tibble() %>%
filter(label == "Citations:") %>%
slice_tail() %>%
select(doi) %>%
pull()
You should get:
[1] "10.1016/j.jneumeth.2014.08.020"
(This update is in the dev-trackvalue
branch only right now, not in main
yet)
Awesome! I'm not sure how to work with branches of packages and still have some other things to finish so will hold off :)
Yeah sure!
If you get curious, I'd do:
remove.packages("cthist")
Then clone the repo, then check out the dev-trackvalue
branch
Then devtools::install()
from there
FYI for checking out the branch, I managed with: remotes::install_github("bgcarlisle/cthist@dev-trackvalue")
Yeah sure!
If you get curious, I'd do:
remove.packages("cthist")
Then clone the repo, then check out the
dev-trackvalue
branchThen
devtools::install()
from there
I'm running the scrape with the dev branch for all of trackvalue (n=168) and it's running particularly slowly, much more than last times. Possibly because of Charité internet, possibly because of this error at each time:
Error downloading version: NCT02169115 version 3
Here's the original error message:
Error in curl::curl_fetch_memory(url, handle = handle): Error in the HTTP2 framing layer
Trying again ...
Recovered from error successfully
Not sure if this is an issue, but flagging in case
Yeah that's polite
's fault
It seems to use an outdated curl
package, which makes it terrible to use
Perhaps there will be an update someday
One initial thought. I see there is a disparity with "empty" references for drks vs. ctgov.
DRKS DRKS00012795 "empty" are an empty list (?)
CTGOV NCT02509962 "empty" is "blank"ish json
Any reason for this discrepancy? Otherwise, as much parity as possible would be super to make combined post-processing easier. E.g., could those "empty" references simply be NA?
I'm looking through the HTML on the historical version and the
reference_type
doesn't seem to be anywhere :(Do you have an example handy of an NCT number with more than one reference?
Also, I'm thinking that reference_type
might be encoded, sometimes, in hard brackets.
For example, for NCT02509962, the result is preceded by "[Study Results]" in history and displays under "Publications of Results:" on the main page:
Whereas, for NCT02110043, the result is not preceded by any bracketed category in history and displays under "Publications" on the main page:
The AACT data (after I download and tidy a bit) for NCT02509962 (AACT data downloaded because the new Study Results uploaded) and NCT02110043 looks like this:
NCT00003636 has all 3 types of results:
And the history looks like this:
And the AACT data (after I download and tidy a bit) looks like this:
In AACT, the 3 possible reference_type
are: result, background, derived.
For parity with your DRKS format, as well as the AACT format, it would be great to have the reference_type
pulled out and "matching"
Can you tell me the NCT number for the "empty" references
in clinicaltrials.gov from two posts ago?
Made the edit!
(I'm going for lunch now, but will address these after)
I changed the DRKS download such that it returns NA for references where there's no data:
https://github.com/bgcarlisle/cthist/commit/4148be6ee9b6e5eec0edb63418458892689f3f63
It looks like the References section on ct dot gov includes the headings "Links" and "Available IPD/Information" even when they're empty, and so these are captured by clinicaltrials_gov_download()
https://clinicaltrials.gov/ct2/history/NCT02509962?V_1=View#StudyPageTop
So, I changed the function to only add rows to the references where there's content, and to return NA in the case that there are no references
https://github.com/bgcarlisle/cthist/commit/2fdde8718694645318bf6ccebad5e69f5e04f4d6
Okay, now it downloads the reference type for ct dot gov citations, if it's present
https://github.com/bgcarlisle/cthist/commit/ab985275156a3ec9705d733f7cccbd533b9443a5
Hi @bgcarlisle !
This is great- thank you! I'm doing more testing and have 2 thoughts, both on the data cleaning munging/end, so not necessary, but are more design choice and could be nice to have.
You can see them with NCT02509962:
First, "links" look to have a "url" and a "description," which could get their own json labels. That would make it easy to separate pub links and other links, such as in NCT01049100
Second, and even more cosmetic: [Study Results] could be removed from content
since now in type
, but that may not align with your design if you're trying to keep content raw.
Since it's looking good, I will download for the whole trackvalue dataset and let you know if anything comes up.
Hello! I ran cthist
for all of trackvalue and did get some errors.
For DRKS DRKS00003568:
The DRKS download stopped there and the other trials didn't download.
I then removed that id and reran which went through, aside from this strange message (why ctgov for drks?):
For CTgov, I got some warnings, but data seems to have downloaded:
This should fix the problem! https://github.com/bgcarlisle/cthist/commit/b39c0c5c4c9bcf5d6487f10bec3c8250059900b1
As for separating out the URL and the description from the links, do you happen to have one handy with multiple links in it? It's hard to tell how they would delimit that
Add a new data point to be extraction: the "References" section
E.g. https://clinicaltrials.gov/ct2/history/NCT02110043?V_8=View#StudyPageTop
Extract: Külzow N, Kerti L, Witte VA, Kopp U, Breitenstein C, Flöel A. An object location memory paradigm for older adults with and without mild cognitive impairment. J Neurosci Methods. 2014 Nov 30;237:16-25. doi: 10.1016/j.jneumeth.2014.08.020. Epub 2014 Aug 28. PubMed 25176026