OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
868 stars 446 forks source link

NOTE NLP table #85

Closed clairblacketer closed 7 years ago

clairblacketer commented 7 years ago

Addition of NOTE NLP table and new fields in NOTE table


Proposal

Relevant table: NOTE

NOTE table additions

Field Required Type Description
note_id Yes integer A unique identifier for each note.
person_id Yes integer A foreign key identifier to the Person about whom the Note was recorded. The demographic details of that Person are stored in the PERSON table.
note_date Yes date The date the note was recorded.
note_datetime No datetime The date and time the note was recorded.
note_type_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the type, origin or provenance of the Note.
note_class_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the HL7 LOINC Document Type Vocabulary classification of the note.
note_title No varchar(250) The title of the Note as it appears in the source.
note_text No RBDMS dependent text The content of the Note.
encoding_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the note character encoding type
language_concept_id Yes integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the language of the note
provider_id No integer A foreign key to the Provider in the PROVIDER table who took the Note.
note_source_value No varchar(50) The source value associated with the origin of the note
visit_occurrence_id No integer Foreign key to the Visit in the VISIT_OCCURRENCE table when the Note was taken.

New Fields

Field Changes

note_text type depends on RDBMS, not all the engines support CLOB, e.g. in MS SQL server this will be VARCHAR(MAX).

Outstanding issues

note_id - convert to BIGINT due to a large table size. Changing identifier fields from INT to BIGINT should have to be a larger group discussion/decision as it would significantly affect all the existing implementations. We should consider whether to change all the identifier fields or a subset. CONDITION_OCCURRENCE, PROCEDURE_OCCURRENCE should be even larger tables.

NOTE_NLP table

This table will encode all output of NLP on clinical notes. Each row represents a single extracted term from a note.

Field Required Type Description
note_nlp_id Yes Big Integer A unique identifier for each term extracted from a note.
note_id Yes integer A foreign key to the Note table note the term was extracted from.
section_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.
snippet No varchar(250) A small window of text surrounding the term.
offset No varchar(50) Character offset of the extracted term in the input note.
lexical_variant Yes varchar(250) Raw text extracted from the NLP tool.
note_nlp_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.
note_nlp_source_concept_id no integer A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system
nlp_system No varchar(250) Name and version of the NLP system that extracted the term.Useful for data provenance.
nlp_date Yes date The date of the note processing.Useful for data provenance.
nlp_datetime No datetime The date and time of the note processing. Useful for data provenance.
term_exists No varchar(1) A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *
term_temporal No varchar(50) An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.
term_modifiers No varchar(2000) A compact description of all the modifiers of the specific term extracted by the NLP system. (e.g. “son has rash” → “negated=no,subject=family,certainty=undef,conditional=false,general=false”).

Term_exists Term_exists is defined as a flag that indicates if the patient actually has or had the condition. Any of the following modifiers would make Term_exists false:

A complete lack of modifiers would make Term_exists true.

For the modifiers that are there, they would have to have these values:

Term_temporal Term_temporal is to indicate if a condition is “present” or just in the “past”.

The following would be past:

Term_modifiers Term_modifiers will concatenate all modifiers for different types of entities (conditions, drugs, labs etc) into one string. Lab values will be saved as one of the modifiers. A list of allowable modifiers (e.g., signature for medications) and their possible values will be standardized later.

Mapping of clinical documents to Clinical Document Ontology (CDO) and standard terminology

HL7/LOINC CDO is a standard for consistent naming of documents to support a range of use cases: retrieval, organization, display, and exchange. It guides the creation of LOINC codes for clinical notes. CDO annotates each document with 5 dimensions:

Each combination of these 5 dimensions should roll up to a unique LOINC code. For example, Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1) has the following dimensions:

Automation of mapping of clinical notes to a standard terminology based on the note title is possible when it is driven by ontology (aka CDO). Mapping to individual LOINC codes which may or may not exist for a particular note type cannot be fully automated. To support mapping of clinical notes to CDO in OMOP CDM, we propose the following approach:

1. Add all LOINC concepts representing 5 CDO dimensions to the Concept table. For example:

Field Record 1 Record 2
concept_id 55443322132 55443322175
concept_name Administrative note Against medical advice note
concept_code LP173418-7 LP173388-2
vocabulary_id LOINC LOINC

2. Represent CDO hierarchy in the Concept_Relationship table using the “Subsumes” – “Is a” relationship pair. For example:

Field Record 1 Record 2
concept_id_1 55443322132 55443322175
concept_id_2 55443322175 55443322132
relationship_id Subsumes Is a

3. Add LOINC document codes to the Concept table (e.g. Dentistry Hygienist Outpatient Progress note, LOINC code 34127-1). For example:

Field Record 1 Record 2
concept_id 193240 193241
concept_name Dentistry Hygienist Outpatient Progress note Consult note
concept_code 34127-1 11488-4
vocabulary_id LOINC LOINC

4. Represent dimensions of each document concept in Concept_Relationship table by its relationships to the respective concepts from CDO. Use the “Member Of” – “Has Member” (new) relationship pair. Using example from the Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1):

concept_id_1 concept_id_1 relationship_id
193240 55443322132 Member Of
55443322132 193240 Has Member
193240 55443322175 Member Of
55443322175 193240 Has Member
193240 55443322166 Member Of
55443322166 193240 Has Member
193240 55443322107 Member Of
55443322107 193240 Has Member
193240 55443322146 Member Of
55443322146 193240 Has Member

Where concept codes represent the following concepts:

Content Description
193240 Corresponds to LOINC 34127-1, Dentistry Hygienist Outpatient Progress note
55443322132 Corresponds to LOINC LP173418-7, Kind of Document = Note
55443322175 Corresponds to LOINC LP173213-2, Type of Service = Progress
55443322166 Corresponds to LOINC LP173051-6, Setting = Outpatient
55443322107 Corresponds to LOINC LP172934-4, Subject Matter Domain  = Dentistry
55443322146 Corresponds to LOINC LP173071-4, Role = Hygienist

Most of the codes will not have all 5 dimensions. Therefore, they may be represented by 2-5 relationship pairs.

5. If LOINC does not have a code corresponding to a permutation of the 5 CDO encountered in the source, this code will be generated as OMOP vocabulary code. Its relationships to the CDO dimensions will be represented exactly as those of existing LOINC concepts (as described above). If/when a proper LOINC code for this permutation is released, the old code should be deprecated. Transition between the old and new codes should be represented by “Concept replaces” – “Concept replaced by” pairs.

6. Mapping from the source data will be performed to the 2-5 CDO dimensions.

Query below finds LOINC code for Dentistry Hygienist Outpatient Progress note (see example above) that has all 5 dimensions:

SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322166 OR concept_id_1 = 55443322107 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2 If less than 5 dimensions are available, HAVING COUNT(n) clause should be added to get a unique record at the intersection of these dimensions. n is the number of dimensions available:

SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2 HAVING COUNT(*) = 3

To identify appropriate dimension while mapping source documents, use the following concept classes:

The proposed approach will ensure that any combination of the 5 CDO dimensions encountered in the source data has a corresponding concept in the vocabulary. It will also support consistent approach to the OMOP CDM/Vocabulary conventions:

A similar mapping approach can be applied to labs.

Use Cases

Example 1 - Left ventricular ejection fraction

Left ventricular ejection fraction is an important indicator of heart health. It is measured during echocardiogram procedures but also during a range of various procedures. The value is frequently reported in clinical reports and has to be extracted using natural language processing.

Name Value
Note_NLP_id 123456
note_id 123446425
section_concept_id <foreign key to "Echocardiogram Report">
snippet ejection fraction was estimated at 60%
lexical_variant ejection fraction
Note_NLP_concept_id <foreign key to "Left Ventricular Ejection Fraction" concept>
NLP_system EchoExtractor_EF(v.2016)
NLP_date 3/30/16
Term_exists TRUE
Value_as_concept_id null
Value_as_number 60.0
Unit_concept_id <foreign key to "percent">
Term_temporal present
Term_modifiers null

Example 2 - eMERGE Phenoytpes

Existence of specific report or specific note section

  1. Presence of a Pathology Report [Appendicitis].
  2. Must contain at least two Past Medical History sections and Medication lists (could substitute two non-acute clinic visits or requirement for annual physical) [Hypothyroidism].
  3. At least 1 abdominal CT or colonoscopy [Diverticulosis].
  4. Patients have to have had a colonoscopy [colonPolyp].
  5. Must have at least a problem list and/or note containing non-empty (can say “none”) medication list and past medical history before or immediately after the time of the ECG [QRS].

Term/Concept mentioning in notes or specific sections

  1. Positive result of inflammation and non-inflammation concept (CUI) in post-surgical biopsy report [Appendicitis].
  2. Reported History of Appendicitis [Appendicitis].
  3. Individual’s patient chart includes one or mentions of an ADHD or hyperkinesia [ADHD].
  4. SSTI cases must have the following or similar keywords in the text results of a bacterial culture lab test, such as skin, wound, boil, abscess, but also recognizing that anatomic sites (e.g. foot/hand/leg/buttock, etc.) [caMRSA].
  5. At least on diagnosis code for C. diff and at least one affirmative mention of C. diff infection (unqualified by negation, uncertainty, or historical reference) in progress notes [CDiff].
  6. Retrieve DSM-IV Symptom criteria (Social Interaction/Communication/Behavior, Interests and Activities) terms from notes to confirm Autism [Autism].
  7. Patient has colonoscopy without positive mention of diverticulosis as control [Diverticulosis].
  8. Positive mention of HF in the problem list through either NLP or structured problem list [HeartFailure].
  9. Cases are those that have polyps in any of their colonoscopy or associated pathology reports [colonPolyp].
  10. Notes contain no evidence of heart disease concepts (NLP for notes, Problem Lists at or near ECG time, ignoring Family Medical History and Allergy sections (using section tagger), ICD9 and CPT codes at or near ECG time describing heart disease) before ECG time or within one month following [QRS].

Related terms mentioning in the same line or adjacent lines

  1. Potential cases were identified if they contained at least one term from List 1 (terms identifying an ace-inhibitor, see below) AND List 2 (terms identifying cough, see below) one the same line (e.g., sentence) within the “Allergy section”, “Medication section” or within the entire “Patient summary section” of the EMR [ACEIcough].
  2. At least one non-negated “Disorder related terms” mention and “Anatomical site related terms” mention either in the SAME or adjacent sentences in a ‘section of interest’ [VTE].

Numeric values with/without temporal constraints

  1. Exclude all patients with an Ejection Fraction (EF or LVEF) <35% within 1 year before or after meeting the CASE 1 definition [ResHTN].
  2. Have evidence from a carotid imaging study of >50% carotid artery stenosis (at least unilaterally) [CAAD].
  3. Classify the type of HF using the numeric EF results (use the lowest EF recorded in the time window) [HeartFailure].
  4. In defining “Normal” ECG, QRSd between 65-120ms, ECG designed as “NORMAL”, Heart Rate between 50-100, ECG Impression must not contain evidence of heart disease concepts [QRS].
schuemie commented 7 years ago

Too late now I guess, but me and some other folks have requested the following (but not in the right places apparently, just in some e-mail communication):

Please split up note_nlp.term_modifiers into several predefined flags (as already mentioned above):

Others can go into a other_term_modifiers bucket, but these are common to almost all NLP system and it makes no sense to have to parse a field you'll be using quite often.

clairblacketer commented 7 years ago

@schuemie Just to be sure I'm clear, would the new table look something like this:

NOTE_NLP

Field Required Type Description
note_nlp_id Yes Big Integer A unique identifier for each term extracted from a note.
note_id Yes integer A foreign key to the Note table note the term was extracted from.
section_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.
snippet No varchar(250) A small window of text surrounding the term.
offset No varchar(50) Character offset of the extracted term in the input note.
lexical_variant Yes varchar(250) Raw text extracted from the NLP tool.
note_nlp_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.
note_nlp_source_concept_id no integer A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system
nlp_system No varchar(250) Name and version of the NLP system that extracted the term.Useful for data provenance.
nlp_date Yes date The date of the note processing.Useful for data provenance.
nlp_date_time No datetime The date and time of the note processing. Useful for data provenance.
term_exists No varchar(1) A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *
term_temporal No varchar(50) An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.
term_negated No varchar(50)
term_subject No varchar(50)
term_certainty No varchar(50)
other_term_modifiers No varchar(2000)
cgreich commented 7 years ago

@schuemie: Have you talked to Hua? Because he was running the entire subgroup that came up with the definition.

pbr6cornell commented 7 years ago

I don't have data with freetext, so I don't consider myself a knowledgeable voice, but perhaps someone can clarify a basic point for me: The NOTE_NLP table appears that it'll contain a row for every token from every note. In that case, this table will be very long (probably billions of records). So, just from a performance perspective, this dataset will get really big if it has a lot of required fields. Some of these fields, like NLP_SYSTEM, NLP_DATE, and NLP_DATETIME would be data that could be contained on the NOTE table, and therefore could potentially be considered redundant and could be all Required = No. NOTE_NLP_ID is following the CDM standard of having a unique record id, but I wonder if that's necessary in this case (is there analytical use case the workgroup had in mind?).

On Fri, Jul 14, 2017 at 8:15 AM, clairblacketer notifications@github.com wrote:

@schuemie https://github.com/schuemie Just to be sure I'm clear, would the new table look something like this: NOTE_NLP Field Required Type Description note_nlp_id Yes Big Integer A unique identifier for each term extracted from a note. note_id Yes integer A foreign key to the Note table note the term was extracted from. section_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term. snippet No varchar(250) A small window of text surrounding the term. offset No varchar(50) Character offset of the extracted term in the input note. lexical_variant Yes varchar(250) Raw text extracted from the NLP tool. note_nlp_concept_id No integer A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table. note_nlp_source_concept_id no integer A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system nlp_system No varchar(250) Name and version of the NLP system that extracted the term.Useful for data provenance. nlp_date Yes date The date of the note processing.Useful for data provenance. nlp_date_time No datetime The date and time of the note processing. Useful for data provenance. term_exists No varchar(1) A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. term_temporal No varchar(50) An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later. term_negated No varchar(50) term_subject No varchar(50) term_certainty No varchar(50) other_term_modifiers No varchar(2000)*

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/CommonDataModel/issues/85#issuecomment-315345366, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsrGuUOAkoX6L1s5bKp-MPKGcLlRxsPks5sN1vwgaJpZM4OXZ7M .

hripcsa commented 7 years ago

Hey, guys. Hold on here. The NLP workgroup, under Hua and with this task led by Noemie, spend a year reviewing research NLP systems and commercial NLP systems and also reviewing phenotyping from groups like eMERGE to see how they used NLP, taking the union of all modifiers, looking at their values, and voted to move forward with the planned table. The strategy was that a couple of fields could actually be agreed upon, and the rest was chaos. So those two fields that were universal would be coded (there was disagreement on a third “value” column). The rest would go into term_modifiers. We would use it for one year or so, and then decide what could be reasonably pulled out.

Adding three new columns based on an email after all the work that was put into this doesn’t make sense. If we are adding those three, then I certainly want my value column back (I was the proponent on that one, and let this email be my request). And of course several datetime columns were suggested just this week that could also go in. I remember several other columns that people requested, too.

If we think the Note_NLP table should not go forward, then let’s make that decision at the CDM workgroup level, but it already voted yes on this. Then these proposals would go back to the NLP workgroup, led by NLP experts, and decide what to do with the columns.

More specifically, I don’t see substantial value to a term_certainty column that has no agreement on values. It avoids a parse, yes, but then there is nothing but misunderstanding. It would only be useful to the site that filled it. That’s the kind of thing that the NLP workgroup decided to put into term_modifiers.

George

From: clairblacketer [mailto:notifications@github.com] Sent: Friday, July 14, 2017 8:16 AM To: OHDSI/CommonDataModel CommonDataModel@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [OHDSI/CommonDataModel] NOTE NLP table (#85)

@schuemiehttps://github.com/schuemie Just to be sure I'm clear, would the new table look something like this:

NOTE_NLP Field

Required

Type

Description

note_nlp_id

Yes

Big Integer

A unique identifier for each term extracted from a note.

note_id

Yes

integer

A foreign key to the Note table note the term was extracted from.

section_concept_id

No

integer

A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.

Snippet

No

varchar(250)

A small window of text surrounding the term.

Offset

No

varchar(50)

Character offset of the extracted term in the input note.

lexical_variant

Yes

varchar(250)

Raw text extracted from the NLP tool.

note_nlp_concept_id

No

integer

A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.

note_nlp_source_concept_id

no

integer

A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system

nlp_system

No

varchar(250)

Name and version of the NLP system that extracted the term.Useful for data provenance.

nlp_date

Yes

date

The date of the note processing.Useful for data provenance.

nlp_date_time

No

datetime

The date and time of the note processing. Useful for data provenance.

term_exists

No

varchar(1)

A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *

term_temporal

No

varchar(50)

An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.

term_negated

No

varchar(50)

term_subject

No

varchar(50)

term_certainty

No

varchar(50)

other_term_modifiers

No

varchar(2000)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/OHDSI/CommonDataModel/issues/85#issuecomment-315345366, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGRVXtox655OPvvdBiYR0OeVzGc-Cf9Rks5sN1vxgaJpZM4OXZ7M.

clairblacketer commented 7 years ago

@schuemie, @cgreich, @pbr6cornell, @hripcsa I did contact Hua, Noemie and Rimma about this issue with these additional columns and I will post their response once I hear back from them.

huaxu7000 commented 7 years ago

As George mentioned, we have gone through extensive discussion with the modifier fields for the NLP table. We have voted and decided on the current version. So the plan is to go forward with the current table and we start implementing it. After we learn more, we can discuss and make changes in the future versions. thanks.

clairblacketer commented 7 years ago

So we will keep the table as proposed with the one modifier field.

schuemie commented 7 years ago

Sorry, didn't mean to upset the decision making process.

I don't agree with the decision (I can make a good case at least for a bit-field term_negated), but I'm not a member of the NLP working group so I will respect its decision.

hripcsa commented 7 years ago

The fact that we should go forward with what we have doesn't mean we should stop discussion. It will help us with the next iteration a year from now.

What is the semantics of term negated and how does it relate to presence and certainty? Presence is an aggregate measure that summarizes what most phenotype measures want to know. Is it actually present? Uncertainty feeds into presence. In many systems uncertainty effectively goes from -1 through 0 to +1. Usually using words instead of numbers, but going from definitely absent to uncertain to definitely present. Are you suggesting that that is better split between uncertainty that goes 0 to 1 and term negated that is effectively a minus sign? I assume we shouldn't have both options.

Real phenotypes I see use presence or it's equivalent all the time (presence includes other stuff like not rule out, not conditional, not another subject, not future, etc). I cannot think of any phenotypes that look for negation in clinical notes other than to skip it. I.e., you don't usually call for negation. Usually because negated in one note is not helpful as it depends on all the other notes. That's why it seemed there was time to sort this out.

George

On Jul 15, 2017, at 7:10 AM, Martijn Schuemie notifications@github.com<mailto:notifications@github.com> wrote:

Sorry, didn't mean to upset the decision making process.

I don't agree with the decision (I can make a good case at least for a bit-field term_negated), but I'm not a member of the NLP working group so I will respect its decision.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/OHDSI/CommonDataModel/issues/85#issuecomment-315527353, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGRVXvu-Vvoirf0ArgkHPr9NT7eZULouks5sOJ3igaJpZM4OXZ7M.

schuemie commented 7 years ago

The context in which I was thinking of using the note_nlp table and negation flag is specifically the construction of features which would subsequently be used in (for example) prediction models. It seems to me that although the most important features would be things that are present, there would be considerable information in things that a doctor (or whoever wrote the note) took the trouble to negate, so I would additionally create negated features. The semantics of negation being a negative statement about something, so "... observed no rash..." or "... rule out pneumonia ..." would be examples of things (rash and pneumonia, resp.) that are negated. Whether those features are informative I don't know, I would have to see if the prediction algorithm selects them into the model. But my hypothesis is that they might be. But with the current structure of note_nlp, where negation isn't standardized, I cannot create these features.

hripcsa commented 7 years ago

(Sorry I haven’t figured out how to quote from within email as opposed to on the forum.)

First, I do agree that machine learning is a little different because it takes advantage of the information whatever it is. As opposed to hand-written rules where you want it to mean what is says. So fields that are less useful for hand-written rules might still be useful for machine learning. So the use case is good.

But second, I am glad you gave those examples. We would absolutely NOT want to call "rule out pneumonia" a negated feature. Quite the opposite, it means possible pneumonia, because someone suspects it. Some systems may call that low certainty, and others literally call it “rule out”, and others call it conditional or something (actually not conditional, but some other concept; conditional is “if the patient gets pneumonia, go to the ED"). That’s very different from "the patient has no pneumonia," which IS a negation. My point in bringing it up is that it is not simple. We want something like negation, but it is actually lot more complicated than that. Unless you meant “RULED out pneumonia,” which could be negation.

Even “observed no rash” is not so simple. Sometimes when a clinician says something was NOT OBSERVED, they mean they did not look for it. That is, they are warning you that they have no information. As opposed to “the patient had no rash,” which would be negation. Depending on the context you might not want to mix those two in a negation flag. Again, it comes down to each system and each researcher interpreting “negation” differently, which undoes its usefulness until we can come to some agreement.

George

On Jul 15, 2017, at 7:53 PM, Martijn Schuemie notifications@github.com<mailto:notifications@github.com> wrote:

The context in which I was thinking of using the note_nlp table and negation flag is specifically the construction of features which would subsequently be used in (for example) prediction models. It seems to me that although the most important features would be things that are present, there would be considerable information in things that a doctor (or whoever wrote the note) took the trouble to negate, so I would additionally create negated features. The semantics of negation being a negative statement about something, so "... observed no rash..." or "... rule out pneumonia ..." would be examples of things (rash and pneumonia, resp.) that are negated. Whether those features are informative I don't know, I would have to see if the prediction algorithm selects them into the model. But my hypothesis is that they might be. But with the current structure of note_nlp, where negation isn't standardized, I cannot create these features.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/OHDSI/CommonDataModel/issues/85#issuecomment-315571194, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGRVXnbbyC9n_b8sKMvyYRvWcq4kO4vKks5sOVDcgaJpZM4OXZ7M.

schuemie commented 7 years ago

I think the issues you mention are generic to NLP: we have no NLP that can figure out the full semantics of natural language, and we are likely to get the 'present' label wrong many times for the same reasons. Anyone using NLP output has to consider its noisy nature.

Despite my poor attempt at defining negation (I guess I meant 'ruled out'), it is a common concept in NLP, for example as implemented in NegEx. And although the boundaries of what negation means are perhaps vague, it suggests quite different semantics than non-negated things, and that distinction may be informative for example for a machine learning algorithm.

cgreich commented 7 years ago

Friends:

Usually it helps in these debates when you do concrete use cases. Then it is much easier to vote on adding the feature or not.

huaxu7000 commented 7 years ago

All modifiers (including negation) will be stored in the term-modifiers field. Therefore it is possible to conduct machine learning studies that use these modifiers. You just need an extra step to parse needed modifiers from the field. The reason for storing all modifiers in one field is that we are not sure how these modifiers will be used. We want to see more actual use cases before we decide on next version. For example, one way to use NLP outputs is to export concepts together with modifiers to corresponding tables (e.g., lab tests and value modifiers go to MEASURES) and label their source as text document.

Keep in mind that another issue of modifiers is that we do not have a set of modifiers and allowable value sets that we all agree on. As people use different NLP systems, we expect that different modifiers will be stored in the term-modifiers field at this time. We are working on a recommendation of common modifiers and their allowable values (probably leverage existing work such as those at Wendy Chapman's lab).

The current Note_NLP table is just a start. We expect more changes will be suggested based on use cases. But for now, let's fix this version so that we can conduct some studies using it and identify potential improvements. Thx

Hua

Sent from my iPhone.

On Jul 16, 2017, at 13:06, Martijn Schuemie notifications@github.com<mailto:notifications@github.com> wrote:

I think the issues you mention are generic to NLP: we have no NLP that can figure out the full semantics of natural language, and we are likely to get the 'present' label wrong many times for the same reasons. Anyone using NLP output has to consider its noisy nature.

Despite my poor attempt at defining negation (I guess I meant 'ruled out'), it is a common concept in NLP, for example as implemented in NegExhttps://urldefense.proofpoint.com/v2/url?u=http-3A__blulab.chpc.utah.edu_content_contextnegex&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=W7sKL1VLMwrF5uaCz_CF5lRXVY2FGKiPXZRXfO0jkV0&e=. And although the boundaries of what negation means are perhaps vague, it suggests quite different semantics than non-negated things, and that distinction may be informative for example for a machine learning algorithm.

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OHDSI_CommonDataModel_issues_85-23issuecomment-2D315626116&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=vx1IHqmTv2BflXpt-BKsTk9wavVPjNEoQBY7JKbqvfo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ALfyJeZWhrvxjAdr4g3TXfI8Wyyuzll4ks5sOlC-5FgaJpZM4OXZ7M&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=ctkbfIFyOruwmrQDRy-FmLz_7i8t7jpQWoqtiCdRNvs&e=.

schuemie commented 7 years ago

@cgreich, the specific use case I have is this: We want to use NLP features in predictive models. More specifically, right now we want to fit propensity models in a Dutch GP EHR system. We have an algorithm for identifying negations, and I want to implement a covariate builder in the FeatureExtraction package that creates separate features for negated and non-negated terms, because I hypothesize there may be value in that (better predictions). I can then plug the covariate builder into CohortMethod.

Right now we would have to come up with a string we would put in the term_modifiers field, and FeatureExtraction would have to look for that string when creating features. But since that string is not standardized, another site will probably use a different string, so we can't create a covariate builder that automatically runs everywhere.

hripcsa commented 7 years ago

Two things.

  1. Hua, what did we decide was the format for the term_modifier string? There is a modifier delimiter to separate modifiers (semicolon ; or something) and a delimiter after the modifier name (colon : or something). So “negation: negated; uncertainty: certain” would be the kind of syntax. Once you pick a modifier name like “negation” then there is no difference between the column and the term_modifier string. Both are equally specified. More problematically, neither has a semantics or a value syntax associated with it, so sharing is unlikely with or without the column. If you planned to define a syntax for the column, then we can also define it for the string.

  2. You could also just use Term_exists as input to ML. It’s mostly negation. Rarely it could include a disease on a different person (family history), but the bulk will be negation of the patient’s conditions. You are using machine learning, so a precise definition is not needed. It will learn its value to the prediction.

George

On Jul 16, 2017, at 3:17 PM, Martijn Schuemie notifications@github.com<mailto:notifications@github.com> wrote:

@cgreichhttps://github.com/cgreich, the specific use case I have is this: We want to use NLP features in predictive models. More specifically, right now we want to fit propensity models in a Dutch GP EHR system. We have an algorithm for identifying negationshttps://www.ncbi.nlm.nih.gov/pubmed/?term=schuemie+negation, and I want to implement a covariate builder in the FeatureExtraction packagehttps://github.com/OHDSI/FeatureExtraction that creates separate features for negated and non-negated terms, because I hypothesize there may be value in that (better predictions). I can then plug the covariate builder into CohortMethodhttps://github.com/OHDSI/CohortMethod.

Right now we would have to come up with a string we would put in the term_modifiers field, and FeatureExtraction would have to look for that string when creating features. But since that string is not standardized, another site will probably use a different string, so we can't create a covariate builder that automatically runs everywhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/OHDSI/CommonDataModel/issues/85#issuecomment-315630654, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGRVXgnpx2oZmVl8-gqJIrt0oB1-JZC7ks5sOmHegaJpZM4OXZ7M.

huaxu7000 commented 7 years ago

Yes, we suggested a format like “negation: negated; uncertainty: certain”. You can query negation from this concatenated field. I agree the more problematic issue is about standardizing the modifiers and their values. As different sites may use different NLP systems, the outputs of modifiers could be different, which makes it challenging to run studies across sites. It will take us some time to make everyone to agree on a standard of modifiers and their values.

clairblacketer commented 7 years ago

Closing this issue as the NOTE_NLP table was added to CDM v5.2 as it appears at the top, though the discussion is still open

clairblacketer commented 6 years ago

for my reference - the document ontology referred to DocumentOntology.xlsx