IHCC-cohorts / data-harmonization

International HundredK+ Cohorts Consortium (IHCC) Data Harmonization
Apache License 2.0
1 stars 1 forks source link

Decide on mapping template #62

Closed jamesaoverton closed 4 years ago

jamesaoverton commented 4 years ago

We are moving to a "self-serve" model, where domain experts for the cohorts will be submitting their own data dictionaries into the system. We're developing a workflow where each submission gets a Google Sheet that the submitter can use to list their terms and the mappings into GECKO. We want to use the same mapping template for all the submissions, going forward. This issue is for nailing down the format of that template.

Column Notes
Term ID automatically assigned by a script?
Label manual, required
Parent Term manual, optional; should be another term from Label
GECKO Category manual, required; a label from the list of CINECA terms; this is the mapping we care about
Suggested Categories automatic; an ordered list of suggestions build by a script (see #61)
Comment manual, optional; submitter comments on the mapping

It might be nice for the automated system to provide notes/details/instructions, but I'm not sure where that would fit if we want to provide an ordered list of suggestions. I suppose we could have a separate suggestion sheet.

beckyjackson commented 4 years ago

These columns look good to me. I can't think of anything that's missing.

I like the idea of automatically assigning the ID via script, but the user will have to specify their prefix. That shouldn't be an issue, right?

I guess the workflow would be that the user enters in:

and optionally:

Then we run the script which assigns IDs to each row and also fills out the "Suggested Categories" column. We use COGS to update the sheet. Finally, the user reviews the suggested categories and picks the one (or a different one) for the "GECKO Category" column.

Does that sound correct?

jamesaoverton commented 4 years ago

Yes, sounds good. We’ll have to ask for a cohort name and prefix.

beckyjackson commented 4 years ago

Do we want to include a "Definition" column? Many cohorts have explanations of their terms, but they may not be "definitions", per se.

jamesaoverton commented 4 years ago

@mcourtot What do you think about including a Definition column in the template?

My opinion is that it would be nice if it was filled out, but it does mean more work for submitters, and maybe that reduces the likelihood that they'll start and/or finish.

jamesaoverton commented 4 years ago

We're going ahead with a Definition column. This is mostly implemented now.

mcourtot commented 4 years ago

Can we have required and optional columns? I agree with your assessment re definition above. Though they could be used for the text mining - sometimes labels are just collections of acronyms.

jamesaoverton commented 4 years ago

Yes, some columns will be optional, and we'll have instructions to make this clear.