National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization

Data Ingestion and Harmonization
41 stars 12 forks source link

Support string type for the NLP datasets ids #91

Closed stephanieshong closed 1 year ago

stephanieshong commented 1 year ago

Some sites are using string data type for note ids and note_nlp ids. So when we convert the string type to long data type the ids become null and we loose all the data from the site. We would need to support string type at the local schema level and do the conversion to long type during the domain mapping step. During the mapping type use the string ids to build out the new N3C ids for NOTE and NOTE_NLP domain.

Note, if the sites are not submitting New NOTE datasets, be sure to use the cached datasets.

stephanieshong commented 1 year ago

allow string type in local_schema.py

  "note": {
        "NOTE_ID": T.StringType(),
        "PERSON_ID": T.StringType(),
        "NOTE_DATE": T.DateType(),
        "NOTE_DATETIME": T.TimestampType(),
        "NOTE_TYPE_CONCEPT_ID": T.IntegerType(),
        "NOTE_CLASS_CONCEPT_ID": T.IntegerType(),
        "NOTE_TITLE": T.StringType(),
        "NOTE_TEXT": T.StringType(),
        "ENCODING_CONCEPT_ID": T.IntegerType(),
        "LANGUAGE_CONCEPT_ID": T.IntegerType(),
        "PROVIDER_ID": T.IntegerType(),
        "VISIT_OCCURRENCE_ID": T.LongType(),
        "VISIT_DETAIL_ID": T.LongType(),
        "NOTE_SOURCE_VALUE": T.StringType(),
    },
"note_nlp": {
    "NOTE_NLP_ID": T.StringType(),
    "NOTE_ID": T.StringType(),
    "SECTION_CONCEPT_ID": T.IntegerType(),
    "SNIPPET": T.StringType(),
    "OFFSET": T.StringType(),
    "LEXICAL_VARIANT": T.StringType(),
    "NOTE_NLP_CONCEPT_ID": T.IntegerType(),
    "NOTE_NLP_SOURCE_CONCEPT_ID": T.IntegerType(),
    "NLP_SYSTEM": T.StringType(),
    "NLP_DATE": T.DateType(),
    "NLP_DATETIME": T.TimestampType(),
    "TERM_EXISTS": T.BooleanType(),
    "TERM_TEMPORAL": T.StringType(),
    "TERM_MODIFIERS": T.StringType(),
},

and update step4 id generation code: step04_domain_mapping/note.sql step04_domain_mapping/note_nlp.sql

stephanieshong commented 1 year ago

PR : https://unite.nih.gov/workspace/data-integration/code/repos/ri.stemma.main.repository.22bfb25d-c63f-4ada-b14f-e9593979ec8b/pulls/ri.pull-request.main.pull-request.807428ed-4648-41c1-ab7a-70418f080e1f/files