DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Can't sort on fields of `nested` type #2621

Open achave11-ucsc opened 3 years ago

achave11-ucsc commented 3 years ago

There are two nested objects stored in Elasticsearch that cause issues with sorting: https://service.dev.singlecell.gi.ucsc.edu/index/samples?sort=organismAgeRange

Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1135, in _get_view_function_response
    response = view_function(**function_args)
  File "/var/task/app.py", line 1157, in get_sample_data
    return repository_search('samples', sample_id)
  File "/var/task/app.py", line 912, in repository_search
    return service.get_data(catalog=catalog,
  File "/var/task/azul/service/index_query_service.py", line 69, in get_data
    response = self.transform_request(catalog=catalog,
  File "/var/task/azul/service/elasticsearch_service.py", line 621, in transform_request
    raise e
  File "/var/task/azul/service/elasticsearch_service.py", line 606, in transform_request
    es_response = es_search.execute(ignore_cache=True)
  File "/opt/python/elasticsearch_dsl/search.py", line 702, in execute
    es.search(
  File "/opt/python/elasticsearch/client/utils.py", line 84, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/opt/python/elasticsearch/client/__init__.py", line 851, in search
    return self.transport.perform_request(
  File "/opt/python/elasticsearch/transport.py", line 351, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/opt/python/elasticsearch/connection/http_requests.py", line 161, in perform_request
    self._raise_error(response.status_code, raw_data)
  File "/opt/python/elasticsearch/connection/base.py", line 229, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'No mapping found for [contents.donors.organism_age_range.keyword] in order to sort on')

https://service.dev.singlecell.gi.ucsc.edu/index/samples?sort=assayType

Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1135, in _get_view_function_response
    response = view_function(**function_args)
  File "/var/task/app.py", line 1157, in get_sample_data
    return repository_search('samples', sample_id)
  File "/var/task/app.py", line 912, in repository_search
    return service.get_data(catalog=catalog,
  File "/var/task/azul/service/index_query_service.py", line 69, in get_data
    response = self.transform_request(catalog=catalog,
  File "/var/task/azul/service/elasticsearch_service.py", line 621, in transform_request
    raise e
  File "/var/task/azul/service/elasticsearch_service.py", line 606, in transform_request
    es_response = es_search.execute(ignore_cache=True)
  File "/opt/python/elasticsearch_dsl/search.py", line 702, in execute
    es.search(
  File "/opt/python/elasticsearch/client/utils.py", line 84, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/opt/python/elasticsearch/client/__init__.py", line 851, in search
    return self.transport.perform_request(
  File "/opt/python/elasticsearch/transport.py", line 351, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/opt/python/elasticsearch/connection/http_requests.py", line 161, in perform_request
    self._raise_error(response.status_code, raw_data)
  File "/opt/python/elasticsearch/connection/base.py", line 229, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'No mapping found for [contents.imaging_protocols.assay_type.keyword] in order to sort on')

There is no defined field to perform the sorting against when using these facets.

Sorting by organismAge is done by lexicographical order on the value field only, it does not utilize the unit field. For example, the broken behavior can exhibit 8 weeks > 7 years.

https://service.azul.data.humancellatlas.org/index/files?catalog=dcp2&filters=%7B%22genusSpecies%22%3A+%7B%22is%22%3A+%5B%22Homo+sapiens%22%5D%7D%7D&sort=organismAge&order=asc&size=1000

http 'https://service.azul.data.humancellatlas.org/index/projects?catalog=dcp2&sort=organismAge&order=asc&size=1000' | jq -r '.hits[].donorOrganisms[]|{value: .organismAge[], unit: .organismAgeUnit[]}' | tail -n 76 | head -n 40
{
  "value": "74",
  "unit": null
}
…
{
  "value": "65",
  "unit": "year"
}
…
{
  "value": "7-8",
  "unit": "week"
}
…
{
  "value": "73",
  "unit": "year"
}

Expected:

  1. That the unit field and value field are taken into consideration when performing the sort by organismAge.

  2. All exposed facets in the sort parameter schema should return responses without error.

achave11-ucsc commented 3 years ago

A possible solution is to offload the sorting of organism age fields to organismAgeRange since it is the most authoritative source for age.

amarjandu commented 3 years ago

Note we will likely need a browser ticket so they can switch to the correct facet to sort on for organismAge.

hannes-ucsc commented 3 years ago

The solution should make use of https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-sort.html#nested-sorting

nadove-ucsc commented 5 months ago

Blocking relationship is erroneous. Implementing the ability to sort by nested fields will not facilitate aggregating them.