Closed blairlearn closed 2 years ago
Mapping file for trial types (trial-type-es-mapping.json
)
{
"settings": {
"index": {
"number_of_shards": "1",
"analysis": {
"normalizer": {
"caseinsensitive_normalizer": {
"type": "custom",
"char_filter": [],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"TrialTypeInformation": {
"_all": { "enabled": false },
"properties": {
"pretty_url_name": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
},
"id_string": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
},
"label": { "type": "keyword" }
}
}
}
}
Mapping file for ListingInfo
documents (listing-info-es-mapping.json
)
{
"settings": {
"index": {
"number_of_shards": "1",
"analysis": {
"normalizer": {
"caseinsensitive_normalizer": {
"type": "custom",
"char_filter": [],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"ListingInfo": {
"_all": {
"enabled": false
},
"properties": {
"concept_id": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
},
"name": {
"properties": {
"label": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
},
"normalized": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
}
}
},
"pretty_url_name": {
"type": "keyword",
"normalizer": "caseinsensitive_normalizer"
}
}
}
}
}
Addendum: After a successful run, the loader should delete any indices (both listing info and trial type) which are more than five days old. Except, the last three successfully loaded indices of both types must be retained.
So, if it's been seven days since the last successful load, the last three successfully loaded indices are retained even though they're more than five days old. Additionally, if whatever is preventing the loader from running is fixed on day eight, only the oldest of those last three indices is deleted.
(Entered in JIRA as [OCECDR-5232])
As an API Developer, I want to retrieve dynamic listing data from Elasticsearch, so that I don't have to implement the search logic
Acceptance Criteria
TODO: Special URLs
Solution
Technical Notes
Data Ingestion Process
On some a recurring basis (Nightly? Weekly? More than once every three years....), a data-processing run occurs to produce a set of documents mapping concept IDs to concept names and pretty-url name segments. These documents are used for looking up information to create one of the trial listing pages under www.cancer.gov/about-cancer/treatment/clinical-trials/disease/ OR www.cancer.gov/about-cancer/treatment/clinical-trials/intervention/
Two document types are produced:
Box markdown is awful for embedding images, but here's a link to a pretty picture showing the data flow.
Processing
What does the Dynamic Listing Loader do? NOTE: See the Processing sencarios for a high-level view.
On some TBD schedule (Every night? Or weekly? Or....?):
From some document store, retrieve the following documents:
label-information.txt
-- see below).override-mappings.txt
-- see below).token-list.txt
-- see below).(See the "Data Files" section below for further details)
Create a new "listing info" index in Elasticsearch based on the loader execution's date and time. (e.g listinginfo_20200903_1525)
Scrape the EVS's RESTful API for a collection of Disease and Intervention value sets (groups of concept IDs, or "C-Codes" with the same display label).
Reference: Term map generator
Load and validate the list of override records. If these conditions are not met, log an error and abort processing:
<INHERIT>
.<BLANK>
or<INHERIT>
.Apply overrides as appropriate. For each of the value sets from step 3:
Create a
ListingInfo
object (it's on the lower half of the page).ConceptId
to an array containing the value set's list of C-Code strings.Name.Label
to the display name of the EVS concept record in the value set with the numerically lowest C-Code.Name.Normalized
to the normalized display name shared by the EVS concept records in the value set (see "Name Normalization Rules" below).PrettyUrlName
to the pretty URL detemined above for the value set (NULL if the URL was discarded because it exceeded the allowable length limit).If there exists an override record with a set of C-Codes which intersects with the value set's set of C-Codes.
<INHERIT>
, setName.Label
to the label from the override record.Name.Normalized
to the normlized version ofName.Label
.PrettyUrlName
:PrettyUrlName
is set to<INHERIT>
, make no change.PrettyUrlName
is set to<BLANK>
, set the value toNULL
.Review the
ListingInfo
objects:PrettyUrlName
values, log an error and abort the job.Save the collection of
ListingInfo
objects to the index.Create
TrialTypeInfo
documents.label-information.txt
:PrettyUrlName
field to the value of the "URL-Friendly" field.IdString
field to the value of the "Identifier" fieldLabel
field to value of the Label text.LabelInformation
document in the index created in step 5.1.If all entries are safely stored, update the
listinginfov1
andlistingtrialtypev1
aliases.Name Normalization Rules
This is how we convert a display name into a normalized form.
Data Files
Files are a convenient metaphor; these don't necessarily need to be implemented as literal text files.
override-mappings.txt
A list of C-Codes with labels and URLs to use as replacements for the ones provided by EVS.
Each record consists of these fields:
<INHERIT>
(with the angle brackets) to use the label from the EVS value set.<BLANK>
to force use of C-Codes in the URL, or<INHERIT>
to use the pretty url derived from the value set's label.NOTE: Override mapping entries should only exist for value sets whose pretty urls (and optional labels) are to be overridden. If nothing is being overridden, there should be no override record.
label-information.txt
A list of text strings to be substituted for other strings.
Each records consists of three fields:
basic-science
) These values must be unique. (i.e. There may not be two entries for 'basic-science'.)basic_science
). These are generally single tokens with no spaces or other punctuation.Basic Science
)token-list.txt
A list of string tokens which, when they appear in a label, will not be altered when the label is normalized.
Each line consists of exactly one token, containing no spaces or punctuation. Non-English letters and letters with diacritics are allowed.
example