georchestra / ansible

Ansible playbooks to deploy a fullblown geOrchestra instance
ISC License
16 stars 12 forks source link

Introducing geonetwork-cloud-searching webservice #90

Closed pmauduit closed 2 years ago

pmauduit commented 3 years ago

This service provides a custom endpoint to get a georss representation of the GeoNetwork index in Elasticsearch, as this is not provided anymore with the v4 of GeoNetwork.

Note: this PR has been created on top of the bullseye branch and has been tested mainly under this version of debian.

landryb commented 3 years ago

sorry for lagging a bit on testing this, but after deploying & testing the PR in a browser visiting /geonetwork/srv/fre/rss.search i got this :

2021-09-23 12:03:06.574 ERROR 36754 --- [nio-8580-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is java.lang.Exception: Failed to connect to index at URL http://localhost:9200/gn-records/_search?. No response processor configured for 'text/html'. Use one of rss, application/rss+xml.] with root cause

java.lang.UnsupportedOperationException: No response processor configured for 'text/html'. Use one of rss, application/rss+xml.

as a comparison point, gn3.8 doesnt blow when queried from a browser, and proposes the rss file for download (ex https://ids.craig.fr/geocat/srv/fre/rss.search)

force-setting the mimetype in curl seems to work and return some xml content that might be valid rss:

curl -H 'Accept: rss' https://georchestra.dev.craig.fr/geonetwork/srv/fre/rss.search

alas, our target usecase is for the rss to be consumed by the agregator module of drupal, which seems to set no Accept header at all (or a wrong one..), if i test the rss feed in the drupal module i get this in the service log:

java.lang.UnsupportedOperationException: No response processor configured for 'null'. Use one of rss, application/rss+xml.

the ua set by the rss client is "Drupal/8.9.18 (+https://www.drupal.org/) GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.3.29-1~deb10u1"

maybe the apache config mumbojumbo should be configured to also send an Accept header instead of/in addition to the Content-Type header ? i've tried with both headers set in /var/www/georchestra/conf/gn-cloud-searching.conf and that didnt work... or maybe it should be in the spring config to accept anything for the Accept/Content-Type headers and returl rss content ?

defaultMimeType in /etc/georchestra/geonetwork/microservices/searching/application.yml isnt used either ?

pmauduit commented 3 years ago

curl -H 'Accept: rss' https://georchestra.dev.craig.fr/geonetwork/srv/fre/rss.search

you need the '?f=rss' param / query string in the url, the accept header should not be necessary

landryb commented 3 years ago

curl -H 'Accept: rss' https://georchestra.dev.craig.fr/geonetwork/srv/fre/rss.search

you need the '?f=rss' param / query string in the url, the accept header should not be necessary

ah thank ! That wasnt obvious :)

https://FQDN/geonetwork/srv/fre/rss.search?f=rss is properly rendered by drupal aggregator, which so far only renders title/description/pubDate in our usecase, cf https://www.craig.fr/aggregator/sources/1. So our usecase seems covered, afaict.

now, for the content of the rss itself (yes, i'm trying to think of usecases for the others) .. with this service we have (for a sample MD generated by datafeeder):

    <item>
      <title>Mes photos RTGE</title>
      <link>https://georchestra.dev.craig.fr/geonetwork/srv/metadata/69c1cd43-45b7-4b27-b0e8-02b978ef1764</link>
      <description>Ceci est un resume du dataset</description>
      <author>psc+testadmin@georchestra.org</author>
      <guid isPermaLink="false">69c1cd43-45b7-4b27-b0e8-02b978ef1764</guid>
      <pubDate>Wed, 12 May 2021 00:00:00 GMT</pubDate>
    </item>

previously with gn3.8 here's all the content that was returned for a fully populated MD:

<item>
<title>
Orthophotographie infrarouge - Département de l'Isère - PVA 2018
</title>
<link>
https://ids.craig.fr/geocat/srv/metadata/27c6a914-954c-4b00-a6d7-8d03f10d399c
</link>
<link href="http://www.craig.fr" type="text/html" rel="alternate" title="Site internet du CRAIG"/>
<link href="http://wms.craig.fr/ortho?" type="application/vnd.ogc.wms_xml" rel="alternate" title="Orthophotographie IRC 25cm 2018"/>
<category>Geographic metadata catalog</category>
<description>
<p><a href="https://ids.craig.fr/geocat/srv/metadata/27c6a914-954c-4b00-a6d7-8d03f10d399c"><img src="https://ids.craig.fr/geocat/srv/api/records/27c6a914-954c-4b00-a6d7-8d03f10d399c/attachments/vigete_ortho_irc38.jpg" align="left" alt="" border="0" width="100"/></a>Le produit "Orthophotographie infrarouge - Département de l'Isère" est une orthophotographie numérique en infrarouge, rectifiées dans la projection associée au système géodésique RGF93. La résolution au sol est de 0,25 par pixel, la précision planimétrique est de 0,50m et les dévers sont &lt; à 34%. La longueur d'onde IR est comprise entre 650 et 960 nm. L'image IRC est composée de canal IR (650 - 960 nm) + Rouge (580 - 700 nm) et Vert (480 - 640nm). es prises de vue aérienne ont été réalisées entre le 7 juillet 2018 et le 28 août 2018. La caméra utilisée est l’une des caméras IGN dites « V2 huit têtes ». La taille des images est d’environ 14000X10400 pixels. La focale utilisée pour les prises de vues départementales est la focale 125 mm.<br/></p><br clear="all"/>
</description>
<pubDate>07 Dec 2020 09:02:29 EST</pubDate>
<guid isPermaLink="true">
https://ids.craig.fr/geocat/srv/metadata/27c6a914-954c-4b00-a6d7-8d03f10d399c
</guid>
<media:content url="https://ids.craig.fr/geocat/srv/api/records/27c6a914-954c-4b00-a6d7-8d03f10d399c/attachments/vigete_ortho_irc38.jpg"/>
<!--
Bounding box in georss GML format (http://georss.org)
-->
<georss:where>
<gml:Envelope>
<gml:lowerCorner>44.6958696017857 4.74204035901312</gml:lowerCorner>
<gml:upperCorner>45.8833927311594 6.35930313759244</gml:upperCorner>
</gml:Envelope>
</georss:where>
</item>

from all those items, i think the additional links from the MD and the md envelope could be valuable informations ? if they can be easily fetched from elasticsearch...

other than that the PR integrating this in the playbook looks fine (minor the comments i already did about the template and the task name)

landryb commented 3 years ago

rebase needed so that it can be merged ?

pmauduit commented 2 years ago

rebase needed so that it can be merged ?

I'd expect git to figure out that the other branch has been merged, but I can also rebase, indeed.

pmauduit commented 2 years ago

done

fvanderbiest commented 2 years ago

Merge ?

landryb commented 2 years ago

iirc i had comments but will fix them in a followup commit

pmauduit commented 2 years ago

Note: the other microservice (ogc-api-records) also provides a RSS output (if configured so).

landryb commented 2 years ago

Note: the other microservice (ogc-api-records) also provides a RSS output (if configured so).

ugh. so much for simplification :)

landryb commented 2 years ago

Fun fact discovered while integrating this behind nginx, even if the right accept/content-type headers are set to application/rss+xml by nginx in the query sent to the service, it returns:

Content-Type: application/json;charset=UTF-8

while the returned content is actually XML. I'm pretty sure some clients will choke on that...

edit: bah, disregard, the Header Set stanzas in the apache config is to set headers in the reply

fvanderbiest commented 2 years ago

while the returned content is actually XML. I'm pretty sure some clients will choke on that...

cc @fgravin for upstream report / change. Thanks !