GeoNode / geonode

GeoNode is an open source platform that facilitates the creation, sharing, and collaborative use of geospatial data.
https://geonode.org/
Other
1.43k stars 1.12k forks source link

CSW search by keyword (`dc:subject`) is unsupported #11207

Open fvicent opened 1 year ago

fvicent commented 1 year ago

Expected Behavior

You should be able to filter by dc:subject (=keywords in pycsw and geonode) within a CSW query.

Actual Behavior

A simple query filtering by keyword returns "Invalid query syntax".

Steps to Reproduce the Problem

Make a GetRecords request against the CSW endpoint:

curl -d "@payload.xml" -X POST "http://localhost/catalogue/csw?service=CSW&version=2.0.2"

With the following payload.xml:

<csw:GetRecords
    xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
    xmlns:ogc="http://www.opengis.net/ogc"
    xmlns:gml="http://www.opengis.net/gml"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dct="http://purl.org/dc/terms/"
    xmlns:gmd="http://www.isotc211.org/2005/gmd"
    xmlns:gco="http://www.isotc211.org/2005/gco"
    xmlns:gmi="http://www.isotc211.org/2005/gmi"
    xmlns:ows="http://www.opengis.net/ows" service="CSW" version="2.0.2" resultType="results" startPosition="1" maxRecords="4">
    <csw:Query typeNames="csw:Record">
        <csw:ElementSetName>full</csw:ElementSetName>
        <csw:Constraint version="1.1.0">
            <ogc:Filter>
                <ogc:PropertyIsEqualTo>
                    <ogc:PropertyName>dc:subject</ogc:PropertyName>
                    <ogc:Literal>something</ogc:Literal>
                </ogc:PropertyIsEqualTo>
            </ogc:Filter>
        </csw:Constraint>
    </csw:Query>
</csw:GetRecords>

Response:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- pycsw 2.6.1 -->
<ows:ExceptionReport xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:gml="http://www.opengis.net/gml" xmlns:ows="http://www.opengis.net/ows" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2.0" language="en" xsi:schemaLocation="http://www.opengis.net/ows http://schemas.opengis.net/ows/1.0.0/owsExceptionReport.xsd">
  <ows:Exception exceptionCode="InvalidParameterValue" locator="constraint">
    <ows:ExceptionText>Invalid query syntax</ows:ExceptionText>
  </ows:Exception>
</ows:ExceptionReport>

But the syntax is valid, since replacing dc:subject by dc:title works as expected.

This is the exception reported by pycsw:

Invalid query syntax.  Query: {'type': 'filter', 'where': 'keyword_csv = %s', 'values': ['asd'], '_dict': {'ogc:Filter': {'ogc:PropertyIsEqualTo': {'ogc:PropertyName': 'dc:subject', 'ogc:Literal': 'asd'}}}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedColumn: column "keyword_csv" does not exist
LINE 1: ...et') AND (uuid IS NOT NULL AND id IN (1, 2)) AND (keyword_cs...
                                                             ^

The above exception was the direct cause of the following exception:

Invalid query syntax.  Query: {'type': 'filter', 'where': 'keyword_csv = %s', 'values': ['asd'], '_dict': {'ogc:Filter': {'ogc:PropertyIsEqualTo': {'ogc:PropertyName': 'dc:subject', 'ogc:Literal': 'asd'}}}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
psycopg2.errors.UndefinedColumn: column "keyword_csv" does not exist
LINE 1: ...et') AND (uuid IS NOT NULL AND id IN (1, 2)) AND (keyword_cs...
                                                             ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pycsw/ogc/csw/csw2.py", line 821, in getrecords
    matched, results = self.parent.repository.query(
  File "/usr/src/geonode/./geonode/catalogue/backends/pycsw_plugin.py", line 165, in query
    total = query.count()
  File "/usr/local/lib/python3.10/dist-packages/django/db/models/query.py", line 412, in count
    return self.query.get_count(using=self.db)
  File "/usr/local/lib/python3.10/dist-packages/django/db/models/sql/query.py", line 528, in get_count
    number = obj.get_aggregation(using, ['__count'])['__count']
  File "/usr/local/lib/python3.10/dist-packages/django/db/models/sql/query.py", line 513, in get_aggregation
    result = compiler.execute_sql(SINGLE)
  File "/usr/local/lib/python3.10/dist-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 79, in _execute
    with self.db.wrap_database_errors:
  File "/usr/local/lib/python3.10/dist-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.10/dist-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: column "keyword_csv" does not exist
LINE 1: ...et') AND (uuid IS NOT NULL AND id IN (1, 2)) AND (keyword_cs...

Which makes sense since keyword_csv is no table field but a model property:

https://github.com/GeoNode/geonode/blob/f3a490a2c2b38975351a887af720e7b7282b385e/geonode/base/models.py#L1339-L1348

Specifications

giohappy commented 1 year ago

@fvicent as you already noticed the dc:subject filter is not backed by a real keyword field, so this is a feature request. What is wrong is the returned error "invalid query syntax". "Unsupported query syntax" would probably be more appropriate.

Would you mind changing the title to "CSW search by keyword (dc:subject) is unsupported"?

fvicent commented 1 year ago

@giohappy I am willing to work on this if you are ok with this addition. If I am right, this is somewhat similar to the csw:AnyText case, whose corresponding field csw_anytext gets populated whenever a dataset or document is saved, in order to support search via CSW. The same could be done for keywords. Or, maybe better, the repository could build a custom database query when dc:subject is requested for search.

giohappy commented 1 year ago

@fvicent that would be great! I would certainly go for the second option, i.e. manipulating the query executed by the PyCSW backend. I invite you to take a look at the CatalogueBackend since there are already parts related to keywords (and links).

etj commented 1 year ago

I would move back the label from feature to bug since dc:subject is one of the core CSW queriables.

Pls note that GeoNode declares dc:subject as a queryable; having GeoNode/pyCSW not accepting such filter is indeed a bug.

giohappy commented 1 year ago

Good point @etj, you're right.

fvicent commented 1 year ago

I've managed to get this working with Postgres by tweaking the SQL filter (keyword_csv = %s) generated by pycsw and executed in GeoNodeRepository.query():

index 7c951b235..d3eadf485 100644
--- a/geonode/catalogue/backends/pycsw_plugin.py
+++ b/geonode/catalogue/backends/pycsw_plugin.py
@@ -41,6 +41,15 @@ GEONODE_SERVICE_TYPES = {
     "urn:x-esri:serviceType:ArcGIS:ImageServer": "ESRI:ArcGIS:ImageServer",
 }

+SELECT_KEYWORD_CSV_QUERY = """\
+(SELECT STRING_AGG("base_hierarchicalkeyword"."slug", ',')
+FROM "base_taggedcontentitem"
+INNER JOIN "base_hierarchicalkeyword"
+ON ("base_taggedcontentitem"."tag_id" = "base_hierarchicalkeyword"."id")
+WHERE "base_taggedcontentitem"."content_object_id" = "base_resourcebase"."id"
+GROUP BY "base_resourcebase"."id")\
+"""
+

 class GeoNodeRepository(Repository):
     """
@@ -156,6 +165,10 @@ class GeoNodeRepository(Repository):
         pycsw_filters = settings.PYCSW.get("FILTER", {"resource_type__in": ["dataset"]})

         if "where" in constraint:  # GetRecords with constraint
+            constraint["where"] = (
+                constraint["where"]
+                .replace("keyword_csv", SELECT_KEYWORD_CSV_QUERY)
+            )
             query = self._get_repo_filter(ResourceBase.objects.filter(**pycsw_filters)).extra(
                 where=[constraint["where"]], params=constraint["values"]
             )

It would be much simpler to use Django's ORM instead of plain SQL, but having filters built and passed by pycsw as SQL code is a serious impediment. I was thinking in something like this, which unfortunately won't work:

index 7c951b235..c48eeb120 100644
--- a/geonode/catalogue/backends/pycsw_plugin.py
+++ b/geonode/catalogue/backends/pycsw_plugin.py
@@ -19,6 +19,7 @@

 import logging

+from django.contrib.postgres.aggregates import StringAgg
 from django.db import connection
 from django.db.models import Max, Min, Count
 from django.conf import settings
@@ -156,8 +157,19 @@ class GeoNodeRepository(Repository):
         pycsw_filters = settings.PYCSW.get("FILTER", {"resource_type__in": ["dataset"]})

         if "where" in constraint:  # GetRecords with constraint
-            query = self._get_repo_filter(ResourceBase.objects.filter(**pycsw_filters)).extra(
-                where=[constraint["where"]], params=constraint["values"]
+            query = (
+                self._get_repo_filter(
+                    ResourceBase.objects.filter(**pycsw_filters)
+                )
+                .annotate(
+                    keyword_csv=StringAgg(
+                        "keywords__slug",
+                        delimiter=","
+                    )
+                )
+                .extra(
+                    where=[constraint["where"]], params=constraint["values"]
+                )
             )
         else:  # GetRecords sans constraint
             query = self._get_repo_filter(ResourceBase.objects.filter(**pycsw_filters))

It seems that annotated fields can't be used within .extra().

Another option would be to rewrite the query() method and get rid of Django's ORM stuff, using plain SQL instead. Then building a query like this one would be much easier:

SELECT * FROM (
    SELECT "base_resourcebase"."id",
           "base_resourcebase"."title",
           "base_resourcebase"."resource_type",
           STRING_AGG("base_hierarchicalkeyword"."slug", ',') AS keyword_csv
    FROM "base_resourcebase"
    INNER JOIN "base_taggedcontentitem" ON ("base_resourcebase"."id" = "base_taggedcontentitem"."content_object_id")
    INNER JOIN "base_hierarchicalkeyword" ON ("base_taggedcontentitem"."tag_id" = "base_hierarchicalkeyword"."id")
    AND "base_resourcebase"."resource_type" = 'layer'
    GROUP BY "base_resourcebase"."id"
) q
WHERE keyword_csv = 'something';   -- <--- Filters passed by pycsw

I couldn't manage to get Django's ORM generate an SQL alike.

@giohappy @etj Thoughts on this?