epifanio / adc-pycsw

Setup and deployment of PyCSW for the Arctic Data Centre (ADC) project
2 stars 0 forks source link

handling OR query #11

Open epifanio opened 10 months ago

epifanio commented 10 months ago

We need to add support for multiple query like:

epifanio commented 10 months ago

@ferrighi @magnarem

Assuming I want to execute the following query:

field_to_query = text_a in bbox_1

OR

field_to_query = text_b in bbox_2

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0" xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:ogc="http://www.opengis.net/ogc" service="CSW" version="2.0.2" resultType="results" startPosition="1" maxRecords="5" outputFormat="application/xml" outputSchema="http://www.isotc211.org/2005/gmd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd" xmlns:gml="http://www.opengis.net/gml" xmlns:gmd="http://www.isotc211.org/2005/gmd">
    <csw:Query typeNames="gmd:MD_Metadata">
        <csw:ElementSetName>brief</csw:ElementSetName>
        <csw:Constraint version="1.1.0">
            <ogc:Filter>
                <ogc:Or>
                    <ogc:And>
                        <ogc:PropertyIsLike wildCard="*" singleChar="?" escapeChar="\\" matchCase="false">
                            <ogc:PropertyName>dc:{field_to_query}}</ogc:PropertyName>
                            <ogc:Literal>text_a</ogc:Literal>
                        </ogc:PropertyIsLike>
                        <ogc:BBOX>
                            <ogc:PropertyName>apiso:BoundingBox</ogc:PropertyName>
                                <gml:Envelope>
                                    <gml:lowerCorner>{lowerCorner_1}</gml:lowerCorner>
                                    <gml:upperCorner>{upperCorner_1}</gml:upperCorner>
                                </gml:Envelope>
                        </ogc:BBOX>
                    </ogc:And>
                    <ogc:And>
                        <ogc:PropertyIsLike wildCard="*" singleChar="?" escapeChar="\\" matchCase="false">
                            <ogc:PropertyName>dc:{field_to_query}</ogc:PropertyName>
                            <ogc:Literal>leaf</ogc:Literal>
                        </ogc:PropertyIsLike>
                        <ogc:BBOX>
                            <ogc:PropertyName>apiso:BoundingBox</ogc:PropertyName>
                                <gml:Envelope>
                                    <gml:lowerCorner>{lowerCorner_2}</gml:lowerCorner>
                                    <gml:upperCorner>{upperCorner_2}</gml:upperCorner>
                                </gml:Envelope>
                        </ogc:BBOX>
                    </ogc:And>
                </ogc:Or>
            </ogc:Filter>
        </csw:Constraint>
    </csw:Query>
</csw:GetRecords>

What the equivalent SOLR syntax will look like? Will it be something like:

{
     "q": "*:*",
     "q.op": "OR",
     "start": 0,
     "rows": "5",
     "fq": [
             "metadata_status:Active",
             "collection:ADC",
             "field_to_query:(text_a text_b)",
             "{!field f=bbox score=overlapRatio}Within(ENVELOPE_1)"
             "{!field f=bbox score=overlapRatio}Within(ENVELOPE_2)"
           ]
}
magnarem commented 10 months ago

For searching keywords for different fields, the syntax is fieldname:querystring. Example "q": "title:ice abstract:core" will find documents that have ice in the title and core in the abstract. the q.op will then do OR or AND of this two query fields.

So it will be something like this:

{
     "q": " field_to_query:(text_a text_b)",
     "q.op": "OR",
     "start": 0,
     "rows": "5",
     "fq": [
             "metadata_status:Active",
             "collection:ADC",
             "{!field f=bbox score=overlapRatio}Within(ENVELOPE_1)"
             "{!field f=bbox score=overlapRatio}Within(ENVELOPE_2)"
           ]
}

I am a bit unsure on how the bbox filters will work..I will check a bit.

magnarem commented 10 months ago

So after investigating a bit more, the correct query for the cws query in this issue will be:

{
     "q": " (title:wind && _query_:"{!field f=bbox}Within(ENVELOPE(13.50,20.24,78.03,76.48))") || (abstract:ice && _query_:"{!field f=bbox}Within(ENVELOPE(17.45,28.63,80.92,78.32))")",
     "q.op": "OR",
     "start": 0,
     "rows": "5",
     "fq": [
             "metadata_status:Active",
             "collection:ADC",
           ]
}

This will return all documents that have the word wind in the title-field and are inside the bounding box ENVELOPE(13.50,20.24,78.03,76.48) AND also return all documents that have the word ice in the abstract-field and are inside the bounding box ENVELOPE(17.45,28.63,80.92,78.32)

So in more pseudo code:

{
     "q": "(<FIELD_TO_QUERY>:<TEXT_A> && _query_:"{!field f=bbox}Within(<ENVELOPE_1>)") || (<FIELD_TO_QUERY>:<TEXT_B> && _query_:"{!field f=bbox}Within(ENVELOPE(<ENVELOPE_2>)")",
     "q.op": "OR",
     "start": 0,
     "rows": "5",
     "fq": [
             "metadata_status:Active",
             "collection:ADC",
           ]
}

The && can be replaced by AND and || replaced by OR. A matter of taste.

epifanio commented 9 months ago

@magnarem

I have tested both query:

 {
    "q": "(title:protected && _query_:\"{!field f=bbox score=overlapRatio}Within(ENVELOPE(60.0,90.0,180.0,0.0))\") OR (title:leaf && _query_:\"{!field f=bbox score=overlapRatio}Within(ENVELOPE(65.0,90.0,180.0,0.0))\")",
     "q.op": "OR",
     "start": 0,
    "rows": "5",
    "fq": [
         "metadata_status:Active",
        "collection:(ADC)"
     ]
 }

and:

 {
    "q": "*:*",
    "q.op": "OR",
    "start": 0,
    "rows": "5",
    "fq": [
        "metadata_status:Active",
        "collection:ADC",
       "title:(protected leaf)",
       "{!field f=bbox score=overlapRatio}Within(ENVELOPE(65.0,90.0,180.0,0.0))",
       "{!field f=bbox score=overlapRatio}Within(ENVELOPE(60.0,90.0,180.0,0.0))"
  ]
}

they both return the same results [1 record] can you confirm the 2 query above are equivalent?

magnarem commented 9 months ago

@epifanio. The queries give the same result, but are not possible the same. See here for difference betweeen q parameter and fq parameter.

So it is the first query when you add the query to the q parameter, that logically is most equal the csv-xml-query.

However, this example is not so good, because there are no document in the index that matches the second part of the query: (title:leaf && _query_:"{!field f=bbox score=overlapRatio}Within(ENVELOPE(65.0,90.0,180.0,0.0))") (http://SOLR/solr/adc/select?debugQuery=true&fl=id%2Ctitle&fq=collection%3A(ADC)&fq=metadata_status%3AActive&indent=true&q.op=OR&q=(title%3Aleaf%20%26%26%20_query_%3A%22%7B!field%20f%3Dbbox%20score%3DoverlapRatio%7DWithin(ENVELOPE(65.0%2C90.0%2C180.0%2C0.0))%22)&rows=5)

So it is not really a way to check the difference, since it only matches the first part of the query.

epifanio commented 9 months ago

I've implemented the code for both, so I will prioritize the sequence of field AND bbox joined by OR in the main q parameter

like in:

{
    "q": "(title:protected && _query_:\"{!field f=bbox score=overlapRatio}Within(ENVELOPE(60.0,90.0,180.0,0.0))\") OR (title:leaf && _query_:\"{!field f=bbox score=overlapRatio}Within(ENVELOPE(65.0,90.0,180.0,0.0))\")",
    "q.op": "OR",
    "start": 0,
    "rows": "5",
    "fq": [
        "metadata_status:Active",
        "collection:(ADC)"
    ]
}