Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
21 stars 12 forks source link

[Feature] create endpoint to filter EDAM concepts by concept types (operation, data, topic) #2633

Closed rrchai closed 2 months ago

rrchai commented 2 months ago

What product(s) is this feature for?

OpenChallenges

Description

Create an endpoint to list EDAM concepts by specified concept types.

The purpose of this endpoint is to support the creation of input data type filter for the web app.

Anything else?

No response

Code of Conduct

tschaffter commented 2 months ago

EDAM includes 4 main sections of concepts (sub-ontologies):

Source

A suitable name for the query parameter may be sections or subOntologies. I will go with "sections".

tschaffter commented 2 months ago

Elasticsearch succeeds in filtering EDAM concepts by section when performing the query in a URL. For example,

http://localhost:9200/openchallenges-edam-concept-000001/_search?q=section:data

Elasticsearch returns the error below when the challenge service is sending the query:

GET {{basePath}}/edamConcepts?sections=data

Error:

Elasticsearch response indicates a failure.
Request: POST /openchallenges-edam-concept-read/_search with parameters {from=0, size=100, track_total_hits=true}
Response: 400 'Bad Request' from 'http://openchallenges-elasticsearch:9200' with body 
{
  "error": {
    "root_cause": [
      {
        "type": "parsing_exception",
        "reason": "[match] unknown token [VALUE_NULL] after [query]",
        "line": 1,
        "col": 91
      }
    ],
    "type": "x_content_parse_exception",
    "reason": "[1:91] [bool] failed to parse field [must]",
    "caused_by": {
      "type": "x_content_parse_exception",
      "reason": "[1:91] [bool] failed to parse field [should]",
      "caused_by": {
        "type": "parsing_exception",
        "reason": "[match] unknown token [VALUE_NULL] after [query]",
        "line": 1,
        "col": 91
      }
    }
  },
  "status": 400
}
]
tschaffter commented 2 months ago

Troubleshooting

There are two concepts that we can't associate to a concept section that may be the cause of the issue. Replacing the null value by "plop", ES shows that there are two results:

http://localhost:9200/openchallenges-edam-concept-000001/_search?q=section:plop

Output:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 7.2367706,
    "hits": [
      {
        "_index": "openchallenges-edam-concept-000001",
        "_type": "_doc",
        "_id": "3473",
        "_score": 7.2367706,
        "_source": {
          "class_id": "http://www.w3.org/2002/07/owl#DeprecatedClass",
          "section": "plop",
          "preferred_label": "DeprecatedClass",
          "_entity_type": "EdamConceptEntity"
        }
      },
      {
        "_index": "openchallenges-edam-concept-000001",
        "_type": "_doc",
        "_id": "3472",
        "_score": 7.2367706,
        "_source": {
          "class_id": "http://www.geneontology.org/formats/oboInOwl#ObsoleteClass",
          "section": "plop",
          "preferred_label": "Obsolete concept (EDAM)",
          "_entity_type": "EdamConceptEntity"
        }
      }
    ]
  }
}

When using "plop" instead of null for the bridge value, the REST API returns results when filtering by section BUT it returns the two concepts with the "plop" section while I'm requesting "data".

This does not make sense yet...

GET {{basePath}}/edamConcepts?sections=data

{
  "number": 0,
  "size": 100,
  "totalElements": 2,
  "totalPages": 1,
  "hasNext": false,
  "hasPrevious": false,
  "edamConcepts": [
    {
      "id": 3473,
      "classId": "http://www.w3.org/2002/07/owl#DeprecatedClass",
      "preferredLabel": "DeprecatedClass"
    },
    {
      "id": 3472,
      "classId": "http://www.geneontology.org/formats/oboInOwl#ObsoleteClass",
      "preferredLabel": "Obsolete concept (EDAM)"
    }
  ]
}

The above output shows that there is another issue besides using null for outlier concepts.

When setting the section to null again for these two deprecated concepts, ES shows that the property is correctly set to null. But then the REST API returns a 500 error as before.

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 18.267937,
    "hits": [
      {
        "_index": "openchallenges-edam-concept-000001",
        "_type": "_doc",
        "_id": "3472",
        "_score": 18.267937,
        "_source": {
          "class_id": "http://www.geneontology.org/formats/oboInOwl#ObsoleteClass",
          "section": null,
          "preferred_label": "Obsolete concept (EDAM)",
          "_entity_type": "EdamConceptEntity"
        }
      }
    ]
  }
}
tschaffter commented 2 months ago

@rrchai One implementation that works is to add a column to the EDAM concepts table named "section". Then the REST API will filter the concept properly using the following entity property:

  @Column(nullable = true)
  @KeywordField()
  private String section;

The first approach I tested was to generate this property dynamically based on the concept class ID. The benefit is that we don't need to add (duplicated) data to the SQL table. I've used this approach for another property in the project but I don't understand why it's not working in this case. I'll time box further exploration to about 30 min or so, otherwise I will push the second solution that involve adding the column "section" to the EDAM SQL table.

tschaffter commented 2 months ago

Query sent to ES when querying the REST API:

GET {{basePath}}/edamConcepts?sections=data
2024-04-17 23:41:35 TRACE [http-nio-8085-exec-2] org.hibernate.search.query - HSEARCH400053: Executing Elasticsearch query on '/openchallenges-edam-concept-read/_search' with parameters '{from=0, size=100, track_total_hits=true}': <{"query":{"bool":{"must":[{"match_all":{}},{"bool":{"should":{"match":{"section":{"query":"banana"}}}}}],"minimum_should_match":"0"}},"_source":false}>

If I search without using query parameter, "section" is not included in the query:

2024-04-17 23:47:11 [http-nio-8085-exec-3] TRACE org.hibernate.search.query - HSEARCH400053: Executing Elasticsearch query on '/openchallenges-edam-concept-read/_search' with parameters '{from=0, size=100, track_total_hits=true}': <{"query":{"bool":{"must":{"match_all":{}},"minimum_should_match":"0"}},"_source":false}>

The two concepts with their section set to "null" - or "banana" temporarily - are returned because the query is incorrect.

tschaffter commented 2 months ago

This shows that the value bridge is used during mass indexing as expected:

2024-04-18 00:52:28 [Hibernate Search - Mass indexing - EdamConceptEntity - ID loading - 0] INFO  o.h.s.m.p.m.i.PojoMassIndexingLoggingMonitor - HSEARCH000027: Mass indexing is going to index 3473 entities.
2024-04-18 00:52:28 [Hibernate Search - Mass indexing - EdamConceptEntity - Entity loading - 2] INFO  o.s.o.c.s.m.s.EdamSectionValueBridge - toIndexedValue value: http://edamontology.org/data_0884
2024-04-18 00:52:28 [Hibernate Search - Mass indexing - EdamConceptEntity - Entity loading - 0] INFO  o.s.o.c.s.m.s.EdamSectionValueBridge - toIndexedValue value: http://edamontology.org/data_0005