Closed rafaleo closed 6 years ago
Hi rafaleo,
Thanks for your message - appreciate your interest in improving this info.
First wanted to verify that I understand your request. This is my interpretation: You're presenting 2 nice-to-have features:
You'd like us to parse the title field of the result_groups table to provide the info in a more granular format. For example, if the value in the title field is "CZP 200 mg Q2W and CZP 400 mg Q4W (Randomized Set)", you'd like to see it as 2 separate rows:
CZP 200 mg Q2W CZP 400 mg Q4W
You'd like the ctgov_group_code to consistently identify the same group in the study - essentially linking the different data (baselines, outcomes, events, etc) to the same group of participants in that study. For example the rows for one study with ctgov_group_codes:
B4 (Baselines) O4 (Outcomes) E4 (Reported Events)
should all be associated with the same group in that study cuz they're all '4's.
About parsing result_groups.title:
It's a good idea. I agree this would make the info more useful and of course it would be great to centralize this functionality in AACT to make it available to everyone, eliminating the need for individuals (like yourself) to have to do it repeatedly with something like regex.
Want to start by mentioning that one of the AACT db fundamental design policies is to present the data exactly as they appear in ClinicalTrials.gov. There are a few places where we're tempted to try to clean things up and standardize the content, but we've made a point to focus on just presenting info 'as is'. That said, we do exactly what you're suggesting with masking data. For example, study NCT03394729 defines masking as: Triple (Care Provider, Investigator, Outcomes Assessor) and we parse that and provide columns for each masking role separately boolean columns:
subject_masked
caregiver_masked
investigator_masked
outcomes_assessor_masked
The thing is, masking data is provided by ClinicalTrials.gov in a predictable, standard format making it easy to parse. The result_groups.title data are free text and could be anything. So while it's tempting to try to parse the content, we can't anticipate all the ways the info is provided and we're concerned we'll misinterpret (and therefore misrepresent) the data.
About having the ctgov_group_code consistently identify the same group of participants across the different data types/tables...
When designing the AACT db, for a long time I insisted it would be designed as you suggest, but reality got in the way. As it turns out, ClinicalTrials.gov just doesn't organize the data that way. There's a note about it tucked away on the AACT website:
_Notice that the integer in the code provided by ClinicalTrials.gov (ctgov_group_code) is often the same for one group across the different result types, but this is not always the case. In the example above, B1, E1 & P1 all represent the 'experimental group', so you're tempted to think that '1' equates to to the 'experimental group' for this study, however for Outcomes, O1 represents the control group. In short, the number in the ctgov_group_code often links the same group across all result types in a study, but for about 25% of studies, this is not the case, so it can't be counted on to indicate this relationship. (We had hoped to use a single row in Result_Groups to uniquely represent a participant group in the study and link all related results data (from the various tables) to that one row, however this was not possible. Therefore, one group will typically be represented multiple times in the ResultGroups table: once for each type of result data.
All to say, I completely understand frustration about that.
Hope this helps. If you have more thoughts about this or additional suggestions, please respond.
Thanks.
Thank you for response. I'm not researcher but am writing a script(s) for someone to produce a report in some special format and as far as these names are different between tables it's hard to find a connection between them. I'm not surprised about second answer - it's what I know and was aware of. But according to the first point splitting is maybe a good idea but there are many groups which may be considered as equal but they are typed differently, and often those differences are beyond any general rule of catching them with some intelligent parser.
For example: this https://drive.google.com/open?id=1DCa1JymEiZWHXAdX27RogWyWW_SQqW-t
sheet is made of 20 tables joined in some way, but one of the most important key here is in column "randomized.drug". But naming between tables is, respectively in outcomes and events:
"Apremilast (APR) 30 mg" = "Week 24: Apremilast 30 mg"
"Placebo (PBO)" = "Week 24: Placebo"
It's easy example, sometimes there are plenty of names and there's no obvious way for me to join (pair) them by such names. I'm trying to extract the core meaning of such names and in most cases it works but it'd be much easier to have numeric ids. I suppose you're thinking about it, but I know (from my work on it) I need to consult with an "content expert" in many cases to make it stick.
I'm appreciate this database is up, because a year ago we had only html tables which quite often had bad format or just big variance of content "anchors". It's easier to use it now and less mistakes are produced in very complicated reports.
I suppose I may expect too much. The authors are responsible for the content. Thank you.
To be more precise, I ask: is it possible to have binding by id between those tables for drugs (as suggested in 'my_imaginary_id' column)?
> interventions
id nct_id intervention_type name description my_imaginary_id
1 2634136 NCT00995345 Drug KRP-104 Tablet 1
2 2634137 NCT00995345 Drug Placebo Tablet 0
> design_groups
id nct_id group_type title my_imaginary_id
1 2627969 NCT00995345 Experimental Dose 1: KRP-104 40 mg 1
2 2627970 NCT00995345 Experimental Dose 2: KRP-104 80 mg 1
3 2627971 NCT00995345 Experimental Dose 3: KRP-104 100 mg 1
4 2627972 NCT00995345 Experimental Dose 4: KRP-104 20/120mg 1
5 2627973 NCT00995345 Placebo Comparator Placebo 0
> result_groups
id nct_id ctgov_group_code result_type title my_imaginary_id
1 3932612 NCT00995345 O2 Outcome Dose 1: KRP-104 1
2 3932613 NCT00995345 O1 Outcome Placebo 0
3 3932614 NCT00995345 E5 Reported Event Dose 4: KRP-104 1
4 3932615 NCT00995345 E4 Reported Event Dose 3: KRP-104 1
5 3932616 NCT00995345 E3 Reported Event Dose 2: KRP-104 1
6 3932617 NCT00995345 E2 Reported Event Dose 1: KRP-104 1
7 3932618 NCT00995345 E1 Reported Event Placebo 0
I understand if it's not possible without authoritative consultations.
I see - and agree completely. It would be great to standardize the drug info and use a proper key to uniquely identify that data element. It would require input from people with expertise in drug identification.
When I looked at the google sheet you sent, it made me think: wouldn't it be nice if this well-structured dataset that you created could be available & query-able to the public so all the work you put into it could benefit others. So I wanted to let you know about an organization (Clinwiki) that is developing a system to do something like that. They have a website that allows the crowd to improve ClinicalTrial.gov data - currently by reviewing & annotating content, but I think introducing better ways to structure drug info is another potential feature.
Know you're busy with regular work, but if you're interested in contributing to such an effort, I encourage you to reach out to them. Here's more info: https://aact.ctti-clinicaltrials.org/use_cases/2
Thank you. I'll take a look in my free time.
On Thu, Feb 1, 2018 at 5:09 PM, Sheri Tibbs notifications@github.com wrote:
I see - and agree completely. It would be great to standardize the drug info and use a proper key to uniquely identify that data element. It would require input from people with expertise in drug identification.
When I looked at the google sheet you sent, it made me think: wouldn't it be nice if this well-structured dataset that you created could be available & query-able to the public so all the work you put into it could benefit others. So I wanted to let you know about an organization (Clinwiki) that is developing a system to do something like that. They have a website that allows the crowd to improve ClinicalTrial.gov data - currently by reviewing & annotating content, but I think introducing better ways to structure drug info is another potential feature.
Know you're busy with regular work, but if you're interested in contributing to such an effort, I encourage you to reach out to them. Here's more info: https://aact.ctti-clinicaltrials.org/use_cases/2
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctti-clinicaltrials/aact/issues/580#issuecomment-362314652, or mute the thread https://github.com/notifications/unsubscribe-auth/AYq64ll60AeTCYtXR6aynFAaqfscFuKOks5tQeGwgaJpZM4RxOqj .
In this table:
result_groups <- dbGetQuery(con, "SELECT * FROM result_groups WHERE nct_id = 'NCT01087788')
there are similar names of titles between result types. Is it possible to get some "core" name to join between corresponding tables or find their ids? Additionally, they are not corresponding (by intuition) with "ctgov_group_code" numbers between result types. I know it's taken directly from tables, but is there any way to get id of the drug name valid for the whole study?For example:
The table 'result_groups' for 'NCT01087788':
For now I need to extract names with regex.