bgcarlisle commented 2 years ago

Add a new data point to be extraction: the "References" section

E.g. https://clinicaltrials.gov/ct2/history/NCT02110043?V_8=View#StudyPageTop

Extract: Külzow N, Kerti L, Witte VA, Kopp U, Breitenstein C, Flöel A. An object location memory paradigm for older adults with and without mild cognitive impairment. J Neurosci Methods. 2014 Nov 30;237:16-25. doi: 10.1016/j.jneumeth.2014.08.020. Epub 2014 Aug 28. PubMed 25176026

maia-sh commented 2 years ago

Ideally, if you could pull out doi and pmid, that would be superb. See https://aact.ctti-clinicaltrials.org/data_dictionary under Table = Study_References. This also have reference_type which is useful (i.e., background, result), but I'm not sure where that's encoded on the web interface.

bgcarlisle commented 2 years ago

I'm looking through the HTML on the historical version and the reference_type doesn't seem to be anywhere :(

Do you have an example handy of an NCT number with more than one reference?

maia-sh commented 2 years ago

Here's one! https://clinicaltrials.gov/ct2/show/NCT04315480

This has both "Publications:" and "Publications automatically indexed to this study by ClinicalTrials.gov Identifier (NCT Number):" and it would be good to have both, but I'm not seeing the latter on the history page which is odd.

Odd about reference_type but true it's not in the main page either.

bgcarlisle commented 2 years ago

Okay here's a working version https://github.com/bgcarlisle/cthist/commit/6ca4f16d50b73a16b54a5f46fadee5002ad9fb1e

It downloads a JSON-encoded table from the references section with the doi and pmid extracted into separate columns

maia-sh commented 2 years ago

Thanks! This is for ctgov right now and the publication links from drks are in #10 .

Should I test now or after both registries?

bgcarlisle commented 2 years ago

Feel free to test now, still working on the DRKS.

If you do the following:

library(tidyverse)
library(jsonlite)
library(cthist)

clinicaltrials_gov_download("NCT02110043") %>%
    select(references) %>%
    slice_tail() %>%
    pull() %>%
    fromJSON() %>%
    tibble() %>%
    filter(label == "Citations:") %>%
    slice_tail() %>%
    select(doi) %>%
    pull()

You should get:

[1] "10.1016/j.jneumeth.2014.08.020"

bgcarlisle commented 2 years ago

(This update is in the dev-trackvalue branch only right now, not in main yet)

maia-sh commented 2 years ago

Awesome! I'm not sure how to work with branches of packages and still have some other things to finish so will hold off :)

bgcarlisle commented 2 years ago

Yeah sure!

If you get curious, I'd do:

remove.packages("cthist")

Then clone the repo, then check out the dev-trackvalue branch

Then devtools::install() from there

maia-sh commented 2 years ago

FYI for checking out the branch, I managed with: remotes::install_github("bgcarlisle/cthist@dev-trackvalue")

Yeah sure!

If you get curious, I'd do:

remove.packages("cthist")

Then clone the repo, then check out the dev-trackvalue branch

Then devtools::install() from there

maia-sh commented 2 years ago

I'm running the scrape with the dev branch for all of trackvalue (n=168) and it's running particularly slowly, much more than last times. Possibly because of Charité internet, possibly because of this error at each time:

Error downloading version: NCT02169115 version 3
Here's the original error message:
Error in curl::curl_fetch_memory(url, handle = handle): Error in the HTTP2 framing layer

Trying again ...
Recovered from error successfully

Not sure if this is an issue, but flagging in case

bgcarlisle commented 2 years ago

Yeah that's polite's fault

It seems to use an outdated curl package, which makes it terrible to use

Perhaps there will be an update someday

maia-sh commented 2 years ago

One initial thought. I see there is a disparity with "empty" references for drks vs. ctgov.

DRKS DRKS00012795 "empty" are an empty list (?) Screen Shot 2022-07-07 at 11 05 13

CTGOV NCT02509962 "empty" is "blank"ish json Screen Shot 2022-07-07 at 11 05 49

Request

Any reason for this discrepancy? Otherwise, as much parity as possible would be super to make combined post-processing easier. E.g., could those "empty" references simply be NA?

maia-sh commented 2 years ago

I'm looking through the HTML on the historical version and the reference_type doesn't seem to be anywhere :(

Do you have an example handy of an NCT number with more than one reference?

Also, I'm thinking that reference_type might be encoded, sometimes, in hard brackets.

For example, for NCT02509962, the result is preceded by "[Study Results]" in history and displays under "Publications of Results:" on the main page: Screen Shot 2022-07-07 at 11 05 49

Whereas, for NCT02110043, the result is not preceded by any bracketed category in history and displays under "Publications" on the main page: Screen Shot 2022-07-07 at 11 13 45

The AACT data (after I download and tidy a bit) for NCT02509962 (AACT data downloaded because the new Study Results uploaded) and NCT02110043 looks like this: Screen Shot 2022-07-07 at 11 24 55

NCT00003636 has all 3 types of results: Screen Shot 2022-07-07 at 11 20 52

And the history looks like this: Screen Shot 2022-07-07 at 11 22 16

And the AACT data (after I download and tidy a bit) looks like this: Screen Shot 2022-07-07 at 11 23 13

In AACT, the 3 possible reference_type are: result, background, derived.

result: these seem to be preceded by [Study Results] in history
background: these seem to have no preceding [] in history
derived: these don't appear in history and just on the main page. as such not in cthist at the moment, since you're scraping this history pages only.

Request

For parity with your DRKS format, as well as the AACT format, it would be great to have the reference_type pulled out and "matching"

bgcarlisle commented 2 years ago

Can you tell me the NCT number for the "empty" references in clinicaltrials.gov from two posts ago?

maia-sh commented 2 years ago

Made the edit!

bgcarlisle commented 2 years ago

(I'm going for lunch now, but will address these after)

bgcarlisle commented 2 years ago

I changed the DRKS download such that it returns NA for references where there's no data:

https://github.com/bgcarlisle/cthist/commit/4148be6ee9b6e5eec0edb63418458892689f3f63

It looks like the References section on ct dot gov includes the headings "Links" and "Available IPD/Information" even when they're empty, and so these are captured by clinicaltrials_gov_download()

https://clinicaltrials.gov/ct2/history/NCT02509962?V_1=View#StudyPageTop

So, I changed the function to only add rows to the references where there's content, and to return NA in the case that there are no references

https://github.com/bgcarlisle/cthist/commit/2fdde8718694645318bf6ccebad5e69f5e04f4d6

bgcarlisle commented 2 years ago

Okay, now it downloads the reference type for ct dot gov citations, if it's present

https://github.com/bgcarlisle/cthist/commit/ab985275156a3ec9705d733f7cccbd533b9443a5

maia-sh commented 2 years ago

Hi @bgcarlisle !

This is great- thank you! I'm doing more testing and have 2 thoughts, both on the data cleaning munging/end, so not necessary, but are more design choice and could be nice to have.

You can see them with NCT02509962:

First, "links" look to have a "url" and a "description," which could get their own json labels. That would make it easy to separate pub links and other links, such as in NCT01049100

Second, and even more cosmetic: [Study Results] could be removed from content since now in type, but that may not align with your design if you're trying to keep content raw.

Since it's looking good, I will download for the whole trackvalue dataset and let you know if anything comes up.

maia-sh commented 2 years ago

Hello! I ran cthist for all of trackvalue and did get some errors.

DRKS

For DRKS DRKS00003568:

The DRKS download stopped there and the other trials didn't download.

I then removed that id and reran which went through, aside from this strange message (why ctgov for drks?):

CTgov

For CTgov, I got some warnings, but data seems to have downloaded:

bgcarlisle commented 2 years ago

This should fix the problem! https://github.com/bgcarlisle/cthist/commit/b39c0c5c4c9bcf5d6487f10bec3c8250059900b1

bgcarlisle commented 2 years ago

As for separating out the URL and the description from the links, do you happen to have one handy with multiple links in it? It's hard to tell how they would delimit that

bgcarlisle / cthist

Add publication link history #3

Request

Request

DRKS

CTgov