AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

Advanced Search Support #3186

Open davidsmejia opened 1 year ago

davidsmejia commented 1 year ago

Context

We want to add more advanced search features to Search API endpoint. Advanced features will allow API users to be able to construct nested logical operators in their queries. There will be some restrictions on the client side for how these are applied but I think we should try to solve a general solution to this problem.

Problem or idea

Currently we support basic text search and selected faceted search parameters / filters that can be applied to the result. Each faceted filter is treated as an "AND" logical operator and multiple values for the same facet are treated as joined with "OR" logical operators.

Ex: Given the following query: search?organism=DANIO_RERIO&organism=HOMO_SAPIENS&technology=microarray This will be interpreted as: technology=MICROARRAY AND (organism=DANIO_RERIO OR organism=HOMO_SAPIENS)

Now consider a user wants to look at (DANIO_RERIO AND microarray) OR (HOMO_SAPIENS AND rna-seq)

This is not supported out of the box. However we do have access to the Search object available from the search_dsl module and override the default self.search in the Search endpoint controller.

In order to support this feature we will want to be able to answer the following questions:

Solution or next step

I am currently leaning toward the POST json request. It would be trivial to encode the request with the freedom that is allowed for by JSON.

Something where we define the operator in the parent and the child contains either a non logical value or another logical value that is arbitrarily nestable. This doesn't solve how we would generate this (how to choose different but equal ways to group the same logical expression), but the simple and expressive pattern here is what I am hoping to accomplish.

{
  "query": {
    "operator": "OR",
    "value": [{
      "operator": "AND",
      "value": [{
        "key": "organism" 
        "value": "DANIO_RERIO"
      },{
        "key": "technology" 
        "value": "microarray"
     }]},{
      "operator": "AND",
      "value": [{
        "key": "organism" 
        "value": "HOMO_SAPIENS"
      },{
        "key": "technology" 
        "value": "rna-seq"
      }]
  }]
}

Then we would convert this using the Search Query and logical operators... Ex from elasticsearch-ds

Q("match", title='python') | Q("match", title='django')
# {"bool": {"should": [...]}}

Q("match", title='python') & Q("match", title='django')
# {"bool": {"must": [...]}}

~Q("match", title="python")
# {"bool": {"must_not": [...]}}

Sources:

Next steps would be to discuss and discover more questions needed to start implementation and answer existing questions.

Inquiries and comments are welcome below.

nozomione commented 1 year ago

Some ideas 💡 : https://restdb.io/docs/querying-with-the-api https://autotask.net/help/developerhelp/content/apis/rest/API_Calls/REST_Basic_Query_Calls.htm https://www.baeldung.com/openapi-json-query-parameters

davidsmejia commented 1 year ago

Some ideas 💡 : https://restdb.io/docs/querying-with-the-api https://autotask.net/help/developerhelp/content/apis/rest/API_Calls/REST_Basic_Query_Calls.htm https://www.baeldung.com/openapi-json-query-parameters

URL Encoding the advanced options might be a good idea because we can "append" the advanced query on top of the existing url parameters that get converted to the elastic search dsl. We would also be able to add another parameter to the documentation.

I think it would be nice to look for existing modules or libraries to help convert the JSON object to the logic that modifies the Search object.