ctti-clinicaltrials / aact

Improving Public Access to Aggregate Content of ClinicalTrials.gov
http://aact.ctti-clinicaltrials.org
MIT License
130 stars 34 forks source link

How to find keys for groups with variant names? #580

Closed rafaleo closed 6 years ago

rafaleo commented 6 years ago

In this table: result_groups <- dbGetQuery(con, "SELECT * FROM result_groups WHERE nct_id = 'NCT01087788') there are similar names of titles between result types. Is it possible to get some "core" name to join between corresponding tables or find their ids? Additionally, they are not corresponding (by intuition) with "ctgov_group_code" numbers between result types. I know it's taken directly from tables, but is there any way to get id of the drug name valid for the whole study?

For example:

id  drug_core_name
1   CZP 400 mg Q4W
2   CZP 200 mg Q2W
3   Placebo

The table 'result_groups' for 'NCT01087788':


       id      nct_id ctgov_group_code      result_type                                              title
1  3896098 NCT01087788               B4         Baseline                                        Total Title
2  3896099 NCT01087788               B3         Baseline                                     CZP 400 mg Q4W
3  3896100 NCT01087788               B2         Baseline                                     CZP 200 mg Q2W
4  3896101 NCT01087788               B1         Baseline                                            Placebo
5  3896102 NCT01087788               P3 Participant Flow                                     CZP 400 mg Q4W
6  3896103 NCT01087788               P2 Participant Flow                                     CZP 200 mg Q2W
7  3896104 NCT01087788               P1 Participant Flow                                            Placebo
8  3896105 NCT01087788               O4          Outcome CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)
9  3896106 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
10 3896107 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
11 3896108 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
12 3896109 NCT01087788               O4          Outcome CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)
13 3896110 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
14 3896111 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
15 3896112 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
16 3896113 NCT01087788               O4          Outcome CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)
17 3896114 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
18 3896115 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
19 3896116 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
20 3896117 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
21 3896118 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
22 3896119 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
23 3896120 NCT01087788               O4          Outcome CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)
24 3896121 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
25 3896122 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
26 3896123 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
27 3896124 NCT01087788               O3          Outcome                    CZP 400 mg Q4W (Randomized Set)
28 3896125 NCT01087788               O2          Outcome                    CZP 200 mg Q2W (Randomized Set)
29 3896126 NCT01087788               O1          Outcome                           Placebo (Randomized Set)
30 3896127 NCT01087788               E3   Reported Event                            All CZP 200 mg + 400 mg
31 3896128 NCT01087788               E2   Reported Event                                 All CZP 400 mg Q4W
32 3896129 NCT01087788               E1   Reported Event                                 All CZP 200 mg Q2W

For now I need to extract names with regex.

tibbs001 commented 6 years ago

Hi rafaleo,

Thanks for your message - appreciate your interest in improving this info.

First wanted to verify that I understand your request. This is my interpretation: You're presenting 2 nice-to-have features:

  1. You'd like us to parse the title field of the result_groups table to provide the info in a more granular format. For example, if the value in the title field is "CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)", you'd like to see it as 2 separate rows:

    CZP 200 mg Q2W CZP 400 mg Q4W

  2. You'd like the ctgov_group_code to consistently identify the same group in the study - essentially linking the different data (baselines, outcomes, events, etc) to the same group of participants in that study. For example the rows for one study with ctgov_group_codes:

    B4 (Baselines) O4 (Outcomes) E4 (Reported Events)

    should all be associated with the same group in that study cuz they're all '4's.

About parsing result_groups.title:

It's a good idea. I agree this would make the info more useful and of course it would be great to centralize this functionality in AACT to make it available to everyone, eliminating the need for individuals (like yourself) to have to do it repeatedly with something like regex.

Want to start by mentioning that one of the AACT db fundamental design policies is to present the data exactly as they appear in ClinicalTrials.gov. There are a few places where we're tempted to try to clean things up and standardize the content, but we've made a point to focus on just presenting info 'as is'. That said, we do exactly what you're suggesting with masking data. For example, study NCT03394729 defines masking as: Triple (Care Provider, Investigator, Outcomes Assessor) and we parse that and provide columns for each masking role separately boolean columns:

 subject_masked
 caregiver_masked
 investigator_masked 
 outcomes_assessor_masked

The thing is, masking data is provided by ClinicalTrials.gov in a predictable, standard format making it easy to parse. The result_groups.title data are free text and could be anything. So while it's tempting to try to parse the content, we can't anticipate all the ways the info is provided and we're concerned we'll misinterpret (and therefore misrepresent) the data.

About having the ctgov_group_code consistently identify the same group of participants across the different data types/tables...

When designing the AACT db, for a long time I insisted it would be designed as you suggest, but reality got in the way. As it turns out, ClinicalTrials.gov just doesn't organize the data that way. There's a note about it tucked away on the AACT website:

_Notice that the integer in the code provided by ClinicalTrials.gov (ctgov_group_code) is often the same for one group across the different result types, but this is not always the case. In the example above, B1, E1 & P1 all represent the 'experimental group', so you're tempted to think that '1' equates to to the 'experimental group' for this study, however for Outcomes, O1 represents the control group. In short, the number in the ctgov_group_code often links the same group across all result types in a study, but for about 25% of studies, this is not the case, so it can't be counted on to indicate this relationship. (We had hoped to use a single row in Result_Groups to uniquely represent a participant group in the study and link all related results data (from the various tables) to that one row, however this was not possible. Therefore, one group will typically be represented multiple times in the ResultGroups table: once for each type of result data.

All to say, I completely understand frustration about that.

Hope this helps. If you have more thoughts about this or additional suggestions, please respond.

Thanks.

rafaleo commented 6 years ago

Thank you for response. I'm not researcher but am writing a script(s) for someone to produce a report in some special format and as far as these names are different between tables it's hard to find a connection between them. I'm not surprised about second answer - it's what I know and was aware of. But according to the first point splitting is maybe a good idea but there are many groups which may be considered as equal but they are typed differently, and often those differences are beyond any general rule of catching them with some intelligent parser.

For example: this https://drive.google.com/open?id=1DCa1JymEiZWHXAdX27RogWyWW_SQqW-t sheet is made of 20 tables joined in some way, but one of the most important key here is in column "randomized.drug". But naming between tables is, respectively in outcomes and events:

"Apremilast (APR)   30 mg" = "Week 24:   Apremilast 30 mg"
"Placebo (PBO)" = "Week 24: Placebo"

It's easy example, sometimes there are plenty of names and there's no obvious way for me to join (pair) them by such names. I'm trying to extract the core meaning of such names and in most cases it works but it'd be much easier to have numeric ids. I suppose you're thinking about it, but I know (from my work on it) I need to consult with an "content expert" in many cases to make it stick.

I'm appreciate this database is up, because a year ago we had only html tables which quite often had bad format or just big variance of content "anchors". It's easier to use it now and less mistakes are produced in very complicated reports.

I suppose I may expect too much. The authors are responsible for the content. Thank you.

rafaleo commented 6 years ago

To be more precise, I ask: is it possible to have binding by id between those tables for drugs (as suggested in 'my_imaginary_id' column)?

> interventions
       id      nct_id intervention_type    name description            my_imaginary_id
1 2634136 NCT00995345              Drug KRP-104      Tablet     1
2 2634137 NCT00995345              Drug Placebo      Tablet      0
> design_groups
       id      nct_id         group_type                    title                                      my_imaginary_id
1 2627969 NCT00995345       Experimental    Dose 1: KRP-104 40 mg       1
2 2627970 NCT00995345       Experimental    Dose 2: KRP-104 80 mg       1
3 2627971 NCT00995345       Experimental   Dose 3: KRP-104 100 mg      1
4 2627972 NCT00995345       Experimental Dose 4: KRP-104 20/120mg    1
5 2627973 NCT00995345 Placebo Comparator                  Placebo            0
> result_groups
        id      nct_id ctgov_group_code      result_type           title                       my_imaginary_id
1  3932612 NCT00995345               O2          Outcome Dose 1: KRP-104       1
2  3932613 NCT00995345               O1          Outcome         Placebo             0
3  3932614 NCT00995345               E5   Reported Event Dose 4: KRP-104     1
4  3932615 NCT00995345               E4   Reported Event Dose 3: KRP-104     1
5  3932616 NCT00995345               E3   Reported Event Dose 2: KRP-104     1
6  3932617 NCT00995345               E2   Reported Event Dose 1: KRP-104     1
7  3932618 NCT00995345               E1   Reported Event         Placebo           0

I understand if it's not possible without authoritative consultations.

tibbs001 commented 6 years ago

I see - and agree completely. It would be great to standardize the drug info and use a proper key to uniquely identify that data element. It would require input from people with expertise in drug identification.

When I looked at the google sheet you sent, it made me think: wouldn't it be nice if this well-structured dataset that you created could be available & query-able to the public so all the work you put into it could benefit others. So I wanted to let you know about an organization (Clinwiki) that is developing a system to do something like that. They have a website that allows the crowd to improve ClinicalTrial.gov data - currently by reviewing & annotating content, but I think introducing better ways to structure drug info is another potential feature.

Know you're busy with regular work, but if you're interested in contributing to such an effort, I encourage you to reach out to them. Here's more info: https://aact.ctti-clinicaltrials.org/use_cases/2

rafaleo commented 6 years ago

Thank you. I'll take a look in my free time.

On Thu, Feb 1, 2018 at 5:09 PM, Sheri Tibbs notifications@github.com wrote:

I see - and agree completely. It would be great to standardize the drug info and use a proper key to uniquely identify that data element. It would require input from people with expertise in drug identification.

When I looked at the google sheet you sent, it made me think: wouldn't it be nice if this well-structured dataset that you created could be available & query-able to the public so all the work you put into it could benefit others. So I wanted to let you know about an organization (Clinwiki) that is developing a system to do something like that. They have a website that allows the crowd to improve ClinicalTrial.gov data - currently by reviewing & annotating content, but I think introducing better ways to structure drug info is another potential feature.

Know you're busy with regular work, but if you're interested in contributing to such an effort, I encourage you to reach out to them. Here's more info: https://aact.ctti-clinicaltrials.org/use_cases/2

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctti-clinicaltrials/aact/issues/580#issuecomment-362314652, or mute the thread https://github.com/notifications/unsubscribe-auth/AYq64ll60AeTCYtXR6aynFAaqfscFuKOks5tQeGwgaJpZM4RxOqj .