Open theathorn opened 6 years ago
This issue is essentially about mapping an Azul facet like disease
to a human readable facet name like Known diseases
and deriving that mapping from the metadata schema. The disease
facet, for example, is extracted from the metadata field specimen_from_organism.diseases
and the schema for specimen_from_organism defines a user-friendly name for that property.
This is complicated by the following issues:
1) scraping the published schemas each time /repository/…
is hit would increase response time by an order of magnitude or more. The load on the S3 bucket that hosts the published schemas would also increase dramatically.
2) the human friendly name as defined in the schema may change over time
3) the property for which we need a human friendly name may be renamed or moved over time
1 and 2 are related. The definitions don't actually change that often, which enables us solve 1 by caching the mapping. The question is where we cache and when we update the cache.
2 and 3 can be solved by using the latest schema that defines the property that the indexer uses. If a property is moved but the indexer isn't using the new location yet, we must use the latest version of the schema that defines the property in the old location, the one used by the indexer. As you can see, this whole problem really is more a domain of the indexer rather than the service.
I think step one is to define the proper place for this in the service API, hard-code the mapping for now and then gradually move from hard-coded mapping (stage 1), over scraping the schemas at deploy time (stage 2) to scraping the schemas periodically at run time (stage 3). In stage 2 we'd be creating a python module to be included in the Chalice distribution similar to what we do for the changelog. In stage 3 we'd write the mapping to ES so it can be read by the service. Stage 4 could add another layer of caching by stashing the mapping in a global variable so it survives multiple lambda invocations.
The upcoming ontology expansion feature (https://github.com/HumanCellAtlas/data-browser/issues/352) will have to battle similar concerns.
The big question for now is whether we add a dedicated /facets
service endpoint to return the mapping (similar /order
) or if we work the mapping into the /repository/…
responses. We could also combine /order
and facets
endpoints into a single /about
endpoint.
What are your thoughts @NoopDog?
The azul webservice assigns its own hard-coded facet labels for metadata entities that it provides to the UI. In some cases the UI then overrides the facet label with its own version (e.g. "Diseases" => "Known Diseases"). Instead of hard-coding these facet labels Azul should derive the labels from the metadata schema and all UI overrides should be removed.
┆Issue is synchronized with this Jira Story ┆Project Name: azul ┆Issue Number: AZUL-354