Import NcML service metadata to SOLR table

tlogan2000 commented 4 years ago

More or less linked to this in https://github.com/Ouranosinc/PAVICS-DataCatalog/issues/52. But looking specifically at importing to the SOLR database: example NcML service:

https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/nrcan/nrcan_canada_daily/tasmin/nrcan_canada_daily_tasmin_2012.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Fnrcan%2Fnrcan_canada_daily%2Ftasmin%2Fcatalog.html&dataset=birdhouse%2Fnrcan%2Fnrcan_canada_daily%2Ftasmin%2Fnrcan_canada_daily_tasmin_2012.nc

tlvu commented 4 years ago

Will start with

launch crawler to see if it still works after Magpie upgrade
enable debugging to see what it does
try to craw a ncml file (without new attribute)
try to find a way to tell crawler to craw a subset of data only for quick round trip testing later

tlogan2000 commented 4 years ago

This could help as well https://github.com/Ouranosinc/pyPavics/blob/master/pavics/catalog.py

huard commented 4 years ago

Notez l'existence de https://github.com/django-haystack/pysolr/ que catalog.py aurait probablement dû utiliser.

tlvu commented 4 years ago

Crawler still works after Magpie upgrade

Launch crawler:

$ curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs="
HTTP/1.1 200 OK                                                                                                                                                             
Date: Wed, 04 Dec 2019 22:53:28 GMT                                                                                                                                         
Server: Apache/2.4.18 (Ubuntu)                                                                                                                                              
Content-Length: 1024                                                                                                                                                        
Vary: Accept-Encoding                                                                                                                                                       
Content-Type: text/xml; charset=utf-8                                                                                                                                       

<?xml version="1.0" encoding="UTF-8"?>                                                                                                                                      
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://ww
w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2019-12-04T22:53:28Z">
        <wps:ProcessAccepted percentCompleted="0">PyWPS Process pavicrawler accepted</wps:ProcessAccepted>
        </wps:Status>
</wps:ExecuteResponse>

Crawler result:

$ curl --include https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 04 Dec 2019 22:53:42 GMT
Content-Type: text/xml
Content-Length: 1489
Last-Modified: Wed, 04 Dec 2019 22:53:33 GMT
Connection: keep-alive
ETag: "5de838ed-5d1"
Accept-Ranges: bytes

<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;amp;service=WPS" statusLocation="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml">
    <wps:Process wps:processVersion="0.1">
        <ows:Identifier>pavicrawler</ows:Identifier>
        <ows:Title>PAVICS Crawler</ows:Title>
        <ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
        </wps:Process>
    <wps:Status creationTime="2019-12-04T22:53:33Z">
        <wps:ProcessSucceeded>PyWPS Process PAVICS Crawler finished</wps:ProcessSucceeded>
        </wps:Status>
        <wps:ProcessOutputs>
                <wps:Output>
            <ows:Identifier>crawler_result</ows:Identifier>
            <ows:Title>PAVICS Crawler Result</ows:Title>
            <ows:Abstract>Crawler result as a json.</ows:Abstract>
            <wps:Reference href="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014/solr_result_2019-12-04T22:53:33Z_.json" mimeType="application/json" encoding="" schema=""/>
                </wps:Output>
        </wps:ProcessOutputs>
</wps:ExecuteResponse>

Solr query result (note "numFound":19):

$ curl --include "http://lvupavicspublic.ouranos.ca:8983/solr/birdhouse/select?q=*%3A*&rows=100&wt=json&indent=true" --silent | head -60
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"*:*",
      "indent":"true",
      "rows":"100",
      "wt":"json"}},
  "response":{"numFound":19,"start":0,"docs":[
      {
        "catalog_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/secure/catalog.xml?dataset=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "replica":false,
        "wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "cf_standard_name":["air_temperature"],
        "keywords":["air_temperature",
          "mon",
          "application/netcdf",
          "tasmax",
          "thredds",
          "CMIP5",
          "rcp45",
          "MPI-ESM-MR",
          "MPI-M"],
        "last_modified":"2019-12-04T22:25:22Z",
        "frequency":"mon",
        "content_type":"application/netcdf",
        "variable":["tasmax"],
        "dataset_id":"testdata.secure",
        "datetime_max":"2006-12-16T12:00:00Z",
        "subject":"Birdhouse Thredds Catalog",
        "category":"thredds",
        "opendap_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "title":"tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "variable_long_name":["Daily Maximum Near-Surface Air Temperature"],
        "project":"CMIP5",
        "source":"http://lvupavicspublic.ouranos.ca:8083//twitcher/ows/proxy/thredds/catalog.xml",
        "datetime_min":"2006-01-16T12:00:00Z",
        "experiment":"rcp45",
        "units":["K"],
        "resourcename":"birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "abstract":"birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "model":"MPI-ESM-MR",
        "latest":true,
        "type":"File",
        "institute":"MPI-M",
        "fileserver_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
        "id":"d128240999685f4c",
        "_version_":1652031501670285312},
      {
        "catalog_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/secure/catalog.xml?dataset=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc",
        "replica":false,
        "wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc",
        "cf_standard_name":["air_temperature"],
        "keywords":["air_temperature",

Which matcher the 19 .nc files I have on my Thredds:

$ find -type f | grep -v ncml | wc -l
19

$ find -type f | grep -v ncml 
./testdata/flyingpigeon/tmax.fut.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
./testdata/flyingpigeon/spatial_analog/indicators_medium.nc
./testdata/flyingpigeon/spatial_analog/dissimilarity.nc
./testdata/flyingpigeon/spatial_analog/indicators_small.nc
./testdata/flyingpigeon/tmax.cur.nc
./testdata/flyingpigeon/cordex/tasmax_EUR-44_MPI-M-MPI-ESM-LR_rcp45_r1i1p1_MPI-CSC-REMO2009_v1_mon_200701-200712.nc
./testdata/flyingpigeon/cordex/tasmax_EUR-44_MPI-M-MPI-ESM-LR_rcp45_r1i1p1_MPI-CSC-REMO2009_v1_mon_200602-200612.nc
./testdata/flyingpigeon/cmip3/tas.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/pr.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tasmax.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tas.sresb1.giss_model_e_r.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tasmin.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/tmax.obs.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc

tlvu commented 4 years ago

There are almost zero logging in the Catalog and pyPavics ! Not much help enabling debug logging.

tlvu commented 4 years ago

Solr really do not persist the crawl result ! If we ever destroy the Solr container (ex: during a Solr image upgrade) we'll have to craw again and wait a few days for the process to finish and in the mean time, no catalog to query !!! Wow !

Edit:

Create bug for Solr data loss https://github.com/Ouranosinc/PAVICS/issues/196

tlvu commented 4 years ago

Craw 1 single .nc file.

curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs=target_files=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc"

tlvu commented 4 years ago

Crawl 1 single .ncml file:

curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs=target_files=birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml"

Solr query result (from empty) containing only the .ncml file:

$ curl --include "http://lvupavicspublic.ouranos.ca:8983/solr/birdhouse/select?q=*%3A*&rows=100&wt=json&indent=true" --silent 
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"*:*",
      "indent":"true",
      "rows":"100",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "catalog_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/catalog/birdhouse/nrcan/nrcan_canada_daily/catalog.xml?dataset=birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "cf_standard_name":["lwe_precipitation_rate"],
        "keywords":["lwe_precipitation_rate",
          "application/netcdf",
          "pr",
          "thredds"],
        "last_modified":"2019-09-17T17:29:00Z",
        "content_type":"application/netcdf",
        "variable":["pr"],
        "dataset_id":"nrcan.nrcan_canada_daily",
        "datetime_max":"2013-12-31T00:00:00Z",
        "subject":"Birdhouse Thredds Catalog",
        "category":"thredds",
        "opendap_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/dodsC/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "title":"pr-agg.ncml",
        "url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/fileServer/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "variable_long_name":["lwe_precipitation_rate"],
        "source":"https://lvupavicspublic-lvu.pagekite.me//twitcher/ows/proxy/thredds/catalog.xml",
        "datetime_min":"1950-01-01T00:00:00Z",
        "replica":false,
        "units":["mm s-1"],
        "resourcename":"birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "abstract":"birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "latest":true,
        "type":"File",
        "fileserver_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/fileServer/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
        "id":"c4278b4707bfe05f",
        "_version_":1652212033388544000}]
  }}

tlvu commented 4 years ago

@tlogan2000 See above my initial investigation with Solr.

Not sure it's useful to want you need (find all missing attribute in all our data).

Right now the catalog is using the opendap link to extract meta-data. The crawler is also doing some attribute exclusion and some attribute remapping.

https://github.com/Ouranosinc/pyPavics/blob/a521e2ba89d8c2b346d605675c35390da02a288c/pavics/catalog.py#L34-L39

https://github.com/Ouranosinc/pyPavics/blob/a521e2ba89d8c2b346d605675c35390da02a288c/pavics/catalog.py#L53-L55

Probably safer to parse the XML NCML view (https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml?catalog=https%3A%2F%2Flvupavicspublic-lvu.pagekite.me%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Fnrcan%2Fnrcan_canada_daily%2Fcatalog.html&dataset=birdhouse%2Fnrcan%2Fnrcan_canada_daily%2Fpr-agg.ncml) that do not filter anything.

We do not have existing code to parse XML and insert into Solr. This is temporary anyways, just to find missing attributes, we don't need to keep this result. Probably do not need Solr as storage. I'd suggest you insert into whatever DB you are most confortable with, as long as at the end of the day you are able to find the missing attributes list.

That said, if you prefer to insert into Solr, I can spend sometime finding how to insert XML data into Solr since long term we still need to understand how Solr works.

huard commented 4 years ago

I think the plan was to eventually replace the current DAP based crawler by simply ingesting the NcML service output into solr. This is not strictly necessary at the moment, but something to think about.

huard commented 1 year ago

Not using SOLR anymore.

Ouranosinc / pavics-vdb

Import NcML service metadata to SOLR table #2