blairlearn commented 3 years ago

As an API Developer, I want to retrieve dynamic listing data from Elasticsearch, so that I don't have to implement the search logic

ESTIMATE 20

Acceptance Criteria

Scenario: A single thesaurus entry, no CTRP name, no display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
       ]  
     }
  When processing occurs 
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
       "label": "Test Term",
       "normalized": "test term"
     },
     "pretty_url_name": "test-term" 
  }

Scenario: A single thesaurus entry, no CTRP name, display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": "Colorectal (Colon or Rectal) Cancer",
       "synonyms": [
       ]  
     }
  When processing occurs 
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
       "label": "Colorectal (Colon or Rectal) Cancer",
       "normalized": "colorectal (colon or rectal) cancer"
     },
     "pretty_url_name": "colorectal-colon-or-rectal-cancer" 
  }

Scenario: A single thesaurus entry, CTRP name, no display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Colorectal (Colon or Rectal) Cancer",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal (Colon or Rectal) Cancer",
     "normalized": "colorectal (colon or rectal) cancer"
    },
    "pretty_url_name": "colorectal-colon-or-rectal-cancer" 
  }

Scenario: A multiple thesaurus entry, same CTRP name, no display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     },
     {
       "code": "C4567",
       "label": "Test Term 2",
       "preferredName": "Test Term 2",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234", "C4567"],
    "name": {
     "label": "CTRP Name",
     "normalized": "ctrp name"
    },
    "pretty_url_name": "ctrp-name" 
  }

Scenario: A multiple thesaurus entry, one CTRP name, matching display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Matching Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     },
     {
       "code": "C4567",
       "label": "Test Term 2",
       "preferredName": "Test Term 2",
       "displayName": "Matching Name",
       "synonyms": [
       ]  
     }
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234", "C4567"],
    "name": {
     "label": "Matching Name",
     "normalized": "matching name"
    },
    "pretty_url_name": "matching-name" 
  }

Scenario: A multiple thesaurus entry, same CTRP name, no display name and a full overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     },
     {
       "code": "C4567",
       "label": "Test Term 2",
       "preferredName": "Test Term 2",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
        ["C1234,C4567", "Colorectal Cancer", "colorectal-cancer"]
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234", "C4567"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": "colorectal-cancer"
  }

Scenario: A multiple thesaurus entry, same CTRP name, no display name and a partial overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     },
     {
       "code": "C4567",
       "label": "Test Term 2",
       "preferredName": "Test Term 2",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
        ["C4567", "Colorectal Cancer", "colorectal-cancer"]
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234", "C4567"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": "colorectal-cancer"
  }

Scenario: A single thesaurus entry, CTRP name, no display name with an overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Colorectal (Colon or Rectal) Cancer",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
       ['C1234', 'Colorectal Cancer', 'colorectal-cancer']       
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": "colorectal-cancer" 
  }

Scenario: A single thesaurus entry, CTRP name with a stage, no display name and no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Stage IIIB test term",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Stage IIIB test term",
     "normalized": "stage IIIB test term"
    },
    "pretty_url_name": "stage-iiib-test-term" 
  }

Scenario: A single thesaurus entry, CTRP name with a stage, no display name with an overrride with no stage
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Stage IIIB test term",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
    And an override exists:
       ["C1234", "Colorectal Cancer", "colorectal-cancer"]
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": "colorectal-cancer" 
  }

Scenario: A single thesaurus entry, CTRP name with a proper noun, no display name no overrride
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "Ewing Sarcoma of Bone",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Ewing Sarcoma of Bone",
     "normalized": "Ewing sarcoma of bone"
    },
    "pretty_url_name": "ewing-sarcoma-of-bone" 
  }

Scenario: A single thesaurus entry, CTRP name, no display name with an overrride inheriting the label
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
       ['C1234', '<INHERIT>', 'colorectal-cancer']       
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal (Colon or Rectal) Cancer",
     "normalized": "colorectal (colon or rectal) cancer"
    },
    "pretty_url_name": "colorectal-colon-or-rectal-cancer" 
  }

Scenario: A single thesaurus entry, CTRP name, no display name with an overrride inheriting the pretty url
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
       ['C1234', 'Colorectal Cancer, '<INHERIT>']       
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": "ctrp-name" 
  }

Scenario: A single thesaurus entry, CTRP name, no display name with an overrride blanking the pretty url
  Given A thesaurus entries
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": null,
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
       ]  
     }
     And an override exists:
       ['C1234', 'Colorectal Cancer, '<BLANK>']       
  When processing occurs
  Then the resulting Elasticsearch index will only contain the following records
  {
    "concept_id": ["C1234"],
    "name": {
     "label": "Colorectal Cancer",
     "normalized": "colorectal cancer"
    },
    "pretty_url_name": null
  }

Scenario: A single thesaurus entry with a  CTRP name, a display name and no overrride
  Given A thesaurus entry
     {
       "code": "C1234",
       "label": "Test Term",
       "preferredName": "Test Term",
       "displayName": "Different Name",
       "synonyms": [
         {
           "termName": "CTRP Name",
           "termGroup": "DN",
           "termSource": "CTRP",
           "sourceCode": null,
           "subsourceName": null
        },
     }
  When processing occurs 
  Then the resulting Elasticsearch index will only contain the following record
      {
       "concept_id": ["C1234"],
       "name": {
          "label": "CTRP Name",
          "normalized": "ctrp name"
        },
        "pretty_url_name": "ctrp-name" 
      }

TODO: Special URLs

Solution

Technical Notes

Data Ingestion Process

On some a recurring basis (Nightly? Weekly? More than once every three years....), a data-processing run occurs to produce a set of documents mapping concept IDs to concept names and pretty-url name segments. These documents are used for looking up information to create one of the trial listing pages under www.cancer.gov/about-cancer/treatment/clinical-trials/disease/ OR www.cancer.gov/about-cancer/treatment/clinical-trials/intervention/

Two document types are produced:

Listing Info - Contains the metadata from combining EVS records with override data (see use case scenarios).
Label Information - Contains "identifier" and label for specific pretty-url names (typically - but not always? - trial types.)

Box markdown is awful for embedding images, but here's a link to a pretty picture showing the data flow.

Processing

What does the Dynamic Listing Loader do? NOTE: See the Processing sencarios for a high-level view.

On some TBD schedule (Every night? Or weekly? Or....?):

From some document store, retrieve the following documents:
- The list of text-replacement (label) mappings (label-information.txt -- see below).
- The list of override mappings (override-mappings.txt -- see below).
- The list of unmodifiable tokens (token-list.txt -- see below).
(See the "Data Files" section below for further details)
Create a new "listing info" index in Elasticsearch based on the loader execution's date and time. (e.g listinginfo_20200903_1525)
Scrape the EVS's RESTful API for a collection of Disease and Intervention value sets (groups of concept IDs, or "C-Codes" with the same display label).
1. Recursively fetch all concepts which are descendants of disease [C7057] or intervention [C1908].
2. Remove C138195 and C131913 because they're "bad" (no further detail available).
3. Find the label for each EVS concept record by using these values in order of decreasing priority:
  1. First synonym with source "CTRP" and group "DN".
  2. EVS record's display name.
  3. EVS record's preferred name.
4. Collect the EVS concept records into "value sets" having the same normalized version of the label (see "Name Normalization Rules" below).
5. Convert each value set's normalized display name to a pretty url using the name-to-prettyurl rules.
  - If the pretty url is more than 75 characters long, discard it. This value set will have no pretty URL.
  - If more than one value set has the same (non-NULL) pretty URL, abort the job.
Reference: Term map generator
Load and validate the list of override records. If these conditions are not met, log an error and abort processing:
- All override records MUST include one or more C-Codes.
- All override records MUST contain a label containing EITHER text OR the special value <INHERIT>.
- All override records MUST contain a pretty-url contaiing either a valid, non-empty, pretty-url segment, or one of the special values <BLANK> or <INHERIT> .
  - Valid pretty-url segments contain only the letters a-z (lowercase), numbers 0-9, and hyphens.
- A given C-Code may only appear in a single override record.
- An override record's set of C-Codes MUST match, at most, those from exactly one value set.
  - If an override's set of C-Codes intersects with those from a value set, it is considered a match.
  - If an override record includes additional C-Codes which are not a match for any value set, a warning should be logged instead of an error. This does NOT abort the job.
  - If an override does not match any value set, a warning should be logged instead of an error. This does NOT abort the job.
- An override record's pretty-url name MUST NOT duplicate another override record's pretty-url name.
- An override record's pretty-url name MUST be no more than 75 characters long.
Apply overrides as appropriate. For each of the value sets from step 3:
1. Create a ListingInfo object (it's on the lower half of the page).
  - Set ConceptId to an array containing the value set's list of C-Code strings.
  - Set Name.Label to the display name of the EVS concept record in the value set with the numerically lowest C-Code.
  - Set Name.Normalized to the normalized display name shared by the EVS concept records in the value set (see "Name Normalization Rules" below).
  - Set PrettyUrlName to the pretty URL detemined above for the value set (NULL if the URL was discarded because it exceeded the allowable length limit).
2. If there exists an override record with a set of C-Codes which intersects with the value set's set of C-Codes.
  1. If the override record contains C-Codes whch are not contained in the value set, log a warning, but continue.
  2. Set the override values
    - Unless the override record's label is <INHERIT>, set Name.Label to the label from the override record.
    - Set Name.Normalized to the normlized version of Name.Label.
    - Set PrettyUrlName:
    - If the override record's PrettyUrlName is set to <INHERIT>, make no change.
    - If the override record's PrettyUrlName is set to <BLANK>, set the value to NULL.
    - Otherwise, set the value to the override record's pretty-url.
3. Review the ListingInfo objects:
  - If there are any duplicate PrettyUrlName values, log an error and abort the job.
Save the collection of ListingInfo objects to the index.
Create TrialTypeInfo documents.
1. Create a new "trial type" index in Elasticsearch based on the loader execution's date and time. (e.g listingtrialtype_20200903_1525)
2. For each record in label-information.txt:
  1. Set the PrettyUrlName field to the value of the "URL-Friendly" field.
  2. Set the IdString field to the value of the "Identifier" field
  3. Set the Label field to value of the Label text.
  4. Store the object in elasticsearch as a LabelInformation document in the index created in step 5.1.
If all entries are safely stored, update the listinginfov1 and listingtrialtypev1 aliases.

Name Normalization Rules

This is how we convert a display name into a normalized form.

All words are converted to lowercase.
Except: Words which appear in the list of unmodifiable tokens (token-list.txt) do NOT have their case altered.
Word boundaries are identified by the presence of either a space or a hyphen.

Data Files

Files are a convenient metaphor; these don't necessarily need to be implemented as literal text files.

override-mappings.txt

A list of C-Codes with labels and URLs to use as replacements for the ones provided by EVS.

Each record consists of these fields:

CCodes A list of one or more C-Codes. A given C-Code may exist in no more than one record.
Label The label to replace the one from EVS. If there is no override, specify <INHERIT> (with the angle brackets) to use the label from the EVS value set.
Pretty URL The pretty-url name segment for the label, replacing the one which would be derived from the EVS value set. May be <BLANK> to force use of C-Codes in the URL, or <INHERIT> to use the pretty url derived from the value set's label.

NOTE: Override mapping entries should only exist for value sets whose pretty urls (and optional labels) are to be overridden. If nothing is being overridden, there should be no override record.

label-information.txt

A list of text strings to be substituted for other strings.

Each records consists of three fields:

URL-Friendly The string in a URL-friendly format (e.g. basic-science) These values must be unique. (i.e. There may not be two entries for 'basic-science'.)
Identifier The string's value for use as an identifier (e.g. basic_science). These are generally single tokens with no spaces or other punctuation.
Label text The string as a label (e.g. Basic Science)

token-list.txt

A list of string tokens which, when they appear in a label, will not be altered when the label is normalized.

Each line consists of exactly one token, containing no spaces or punctuation. Non-English letters and letters with diacritics are allowed.

example

I
IA
IB
IC
IA1
IA2
IB1
IB2
II
IIA
IIB
III
IIIA
IIIB
IIIC
IV
IVA
IVB
IVC
IIB
Kaposi
Hodgkin
Sézary
Ewing
Langerhans
Merkel
Wilms
Burkitt

blairlearn commented 3 years ago

Mapping file for trial types (trial-type-es-mapping.json)

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "normalizer": {
                    "caseinsensitive_normalizer": {
                        "type": "custom",
                        "char_filter": [],
                        "filter": [
                            "lowercase",
                            "asciifolding"
                        ]
                    }
                }
            }
        }
    },
    "mappings": {
        "TrialTypeInformation": {
            "_all":         { "enabled": false },
            "properties":   {
                "pretty_url_name":  {
                                        "type": "keyword",
                                        "normalizer": "caseinsensitive_normalizer"
                                    },
                "id_string":        {
                                        "type": "keyword",
                                        "normalizer": "caseinsensitive_normalizer"
                                    },
                "label":            { "type": "keyword"    }
            }
        }
    }
}

blairlearn commented 3 years ago

Mapping file for ListingInfo documents (listing-info-es-mapping.json)

{
    "settings": {
        "index": {
            "number_of_shards": "1",
            "analysis": {
                "normalizer": {
                    "caseinsensitive_normalizer": {
                        "type": "custom",
                        "char_filter": [],
                        "filter": [
                            "lowercase",
                            "asciifolding"
                        ]
                    }
                }
            }
        }
    },
    "mappings": {
        "ListingInfo": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "concept_id": {
                    "type": "keyword",
                    "normalizer": "caseinsensitive_normalizer"
                },
                "name": {
                    "properties": {
                        "label": {
                                "type": "keyword",
                                "normalizer": "caseinsensitive_normalizer"
                            },
                        "normalized": {
                                "type": "keyword",
                                "normalizer": "caseinsensitive_normalizer"
                            }
                    }
                },
                "pretty_url_name": {
                    "type": "keyword",
                    "normalizer": "caseinsensitive_normalizer"
                }
            }
        }
    }
}

blairlearn commented 1 year ago

Addendum: After a successful run, the loader should delete any indices (both listing info and trial type) which are more than five days old. Except, the last three successfully loaded indices of both types must be retained.

So, if it's been seven days since the last successful load, the last three successfully loaded indices are retained even though they're more than five days old. Additionally, if whatever is preventing the loader from running is fixed on day eight, only the oldest of those last three indices is deleted.

(Entered in JIRA as [OCECDR-5232])

NCIOCPL / clinical-trials-listing-api

Story: Create Dynamic Listing Data Loader #2

As an API Developer, I want to retrieve dynamic listing data from Elasticsearch, so that I don't have to implement the search logic

Acceptance Criteria

Solution

Technical Notes

Data Ingestion Process

Processing

Name Normalization Rules

Data Files

override-mappings.txt

label-information.txt

token-list.txt