We would like to be able to generate a vocabulary based on a Scan Report, to upload it to Athena. This is an initial proposal, and welcome comments based on it.
Local code that generates vocabularies "tables" as in, what you download from Athena. This is mostly focussed on Concepts table initially.
Can we reliably detect a question? For example Has smoked in the last year? Yes | No is a sample data point in the column header, that would need to parsed to Smoked in the last year, with options of yes | no
This data is captured in an EDC, can could take potentially any format/content really.
Version 1:
A small form that enables a user to export vocabularies from a Scan Report.
Form asks for the vocabulary_id to effectively name the vocabulary, the vocabulary is generated and returned to the user.
Carrot will build a vocabulary, similar to how it currently exports the mapping_rules to .csv
This all depends on mapping rules existing.
This approach leaves the user to complete the concept_name column of the template.
Version 2:
As above.
User will supply a source_dictionary in the form, this should include at least 2 columns, a mapping from the concept_code to the concept_name, which is effectively a description of the vocabulary term.
Carrot uses this to populate the concept_name of the template
We can either define a template for this form, or allow a user to upload a spreadsheet, and let them select which columns contain the code/name.
Caveats
We will need to be clear to the user, that there are limitations to this export and it depends on:
The Scan Report minimum cell count (truncation) of White Rabbit. Data might have been lost, so therefore there cannot be a complete vocabulary of it.
The need for QA checking of the export
Tasks
[ ] Backend - export vocabulary service (I anticipate this being an Azure function)
[ ] Frontend - a form to support exporting the vocabulary.
Acceptance Criteria
[ ] A way for the user to export vocabulary information from a Scan Report.
Summary
We would like to be able to generate a vocabulary based on a Scan Report, to upload it to Athena. This is an initial proposal, and welcome comments based on it.
Supporting Documentation:
A potential roadmap for versions of this:
Version 0.1
Has smoked in the last year? Yes | No
is a sample data point in the column header, that would need to parsed toSmoked in the last year
, with options ofyes
|no
Version 1:
vocabulary_id
to effectively name the vocabulary, the vocabulary is generated and returned to the user..csv
concept_name
column of the template.Version 2:
source_dictionary
in the form, this should include at least 2 columns, a mapping from theconcept_code
to theconcept_name
, which is effectively a description of the vocabulary term.concept_name
of the templateCaveats
We will need to be clear to the user, that there are limitations to this export and it depends on:
Tasks
Acceptance Criteria