Make facet labels be driven by the metadata schema

This issue is essentially about mapping an Azul facet like disease to a human readable facet name like Known diseases and deriving that mapping from the metadata schema. The disease facet, for example, is extracted from the metadata field specimen_from_organism.diseases and the schema for specimen_from_organism defines a user-friendly name for that property.

This is complicated by the following issues:

1) scraping the published schemas each time /repository/… is hit would increase response time by an order of magnitude or more. The load on the S3 bucket that hosts the published schemas would also increase dramatically.

2) the human friendly name as defined in the schema may change over time

3) the property for which we need a human friendly name may be renamed or moved over time

1 and 2 are related. The definitions don't actually change that often, which enables us solve 1 by caching the mapping. The question is where we cache and when we update the cache.

2 and 3 can be solved by using the latest schema that defines the property that the indexer uses. If a property is moved but the indexer isn't using the new location yet, we must use the latest version of the schema that defines the property in the old location, the one used by the indexer. As you can see, this whole problem really is more a domain of the indexer rather than the service.

I think step one is to define the proper place for this in the service API, hard-code the mapping for now and then gradually move from hard-coded mapping (stage 1), over scraping the schemas at deploy time (stage 2) to scraping the schemas periodically at run time (stage 3). In stage 2 we'd be creating a python module to be included in the Chalice distribution similar to what we do for the changelog. In stage 3 we'd write the mapping to ES so it can be read by the service. Stage 4 could add another layer of caching by stashing the mapping in a global variable so it survives multiple lambda invocations.

The upcoming ontology expansion feature (https://github.com/HumanCellAtlas/data-browser/issues/352) will have to battle similar concerns.

The big question for now is whether we add a dedicated /facets service endpoint to return the mapping (similar /order) or if we work the mapping into the /repository/… responses. We could also combine /order and facets endpoints into a single /about endpoint.

What are your thoughts @NoopDog?

DataBiosphere / azul

Make facet labels be driven by the metadata schema #552