Closed tlogan2000 closed 1 year ago
Will start with
This could help as well https://github.com/Ouranosinc/pyPavics/blob/master/pavics/catalog.py
Notez l'existence de https://github.com/django-haystack/pysolr/ que catalog.py aurait probablement dû utiliser.
Crawler still works after Magpie upgrade
Launch crawler:
$ curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs="
HTTP/1.1 200 OK
Date: Wed, 04 Dec 2019 22:53:28 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Length: 1024
Vary: Accept-Encoding
Content-Type: text/xml; charset=utf-8
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://ww
w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2019-12-04T22:53:28Z">
<wps:ProcessAccepted percentCompleted="0">PyWPS Process pavicrawler accepted</wps:ProcessAccepted>
</wps:Status>
</wps:ExecuteResponse>
Crawler result:
$ curl --include https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml
HTTP/1.1 200 OK
Server: nginx/1.13.6
Date: Wed, 04 Dec 2019 22:53:42 GMT
Content-Type: text/xml
Content-Length: 1489
Last-Modified: Wed, 04 Dec 2019 22:53:33 GMT
Connection: keep-alive
ETag: "5de838ed-5d1"
Accept-Ranges: bytes
<?xml version="1.0" encoding="UTF-8"?>
<wps:ExecuteResponse xmlns:wps="http://www.opengis.net/wps/1.0.0" xmlns:ows="http://www.opengis.net/ows/1.1" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wps/1.0.0 ../wpsExecute_response.xsd" service="WPS" version="1.0.0" xml:lang="en-US" serviceInstance="http://localhost/wps?request=GetCapabilities&amp;service=WPS" statusLocation="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014.xml">
<wps:Process wps:processVersion="0.1">
<ows:Identifier>pavicrawler</ows:Identifier>
<ows:Title>PAVICS Crawler</ows:Title>
<ows:Abstract>Crawl thredds server and write metadata to SOLR database.</ows:Abstract>
</wps:Process>
<wps:Status creationTime="2019-12-04T22:53:33Z">
<wps:ProcessSucceeded>PyWPS Process PAVICS Crawler finished</wps:ProcessSucceeded>
</wps:Status>
<wps:ProcessOutputs>
<wps:Output>
<ows:Identifier>crawler_result</ows:Identifier>
<ows:Title>PAVICS Crawler Result</ows:Title>
<ows:Abstract>Crawler result as a json.</ows:Abstract>
<wps:Reference href="https://lvupavicspublic-lvu.pagekite.me/wpsoutputs/catalog/e31a4914-16e8-11ea-aab9-0242ac130014/solr_result_2019-12-04T22:53:33Z_.json" mimeType="application/json" encoding="" schema=""/>
</wps:Output>
</wps:ProcessOutputs>
</wps:ExecuteResponse>
Solr query result (note "numFound":19
):
$ curl --include "http://lvupavicspublic.ouranos.ca:8983/solr/birdhouse/select?q=*%3A*&rows=100&wt=json&indent=true" --silent | head -60
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"*:*",
"indent":"true",
"rows":"100",
"wt":"json"}},
"response":{"numFound":19,"start":0,"docs":[
{
"catalog_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/secure/catalog.xml?dataset=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"replica":false,
"wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"cf_standard_name":["air_temperature"],
"keywords":["air_temperature",
"mon",
"application/netcdf",
"tasmax",
"thredds",
"CMIP5",
"rcp45",
"MPI-ESM-MR",
"MPI-M"],
"last_modified":"2019-12-04T22:25:22Z",
"frequency":"mon",
"content_type":"application/netcdf",
"variable":["tasmax"],
"dataset_id":"testdata.secure",
"datetime_max":"2006-12-16T12:00:00Z",
"subject":"Birdhouse Thredds Catalog",
"category":"thredds",
"opendap_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"title":"tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"variable_long_name":["Daily Maximum Near-Surface Air Temperature"],
"project":"CMIP5",
"source":"http://lvupavicspublic.ouranos.ca:8083//twitcher/ows/proxy/thredds/catalog.xml",
"datetime_min":"2006-01-16T12:00:00Z",
"experiment":"rcp45",
"units":["K"],
"resourcename":"birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"abstract":"birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"model":"MPI-ESM-MR",
"latest":true,
"type":"File",
"institute":"MPI-M",
"fileserver_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc",
"id":"d128240999685f4c",
"_version_":1652031501670285312},
{
"catalog_url":"http://lvupavicspublic.ouranos.ca:8083/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/secure/catalog.xml?dataset=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc",
"replica":false,
"wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc",
"cf_standard_name":["air_temperature"],
"keywords":["air_temperature",
Which matcher the 19 .nc
files I have on my Thredds:
$ find -type f | grep -v ncml | wc -l
19
$ find -type f | grep -v ncml
./testdata/flyingpigeon/tmax.fut.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc
./testdata/flyingpigeon/cmip5/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
./testdata/flyingpigeon/spatial_analog/indicators_medium.nc
./testdata/flyingpigeon/spatial_analog/dissimilarity.nc
./testdata/flyingpigeon/spatial_analog/indicators_small.nc
./testdata/flyingpigeon/tmax.cur.nc
./testdata/flyingpigeon/cordex/tasmax_EUR-44_MPI-M-MPI-ESM-LR_rcp45_r1i1p1_MPI-CSC-REMO2009_v1_mon_200701-200712.nc
./testdata/flyingpigeon/cordex/tasmax_EUR-44_MPI-M-MPI-ESM-LR_rcp45_r1i1p1_MPI-CSC-REMO2009_v1_mon_200602-200612.nc
./testdata/flyingpigeon/cmip3/tas.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/pr.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tasmax.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tas.sresb1.giss_model_e_r.run1.atm.da.nc
./testdata/flyingpigeon/cmip3/tasmin.sresa2.miub_echo_g.run1.atm.da.nc
./testdata/flyingpigeon/tmax.obs.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200601-200612.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc
./testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r1i1p1_200701-200712.nc
There are almost zero logging in the Catalog and pyPavics ! Not much help enabling debug logging.
Solr really do not persist the crawl result ! If we ever destroy the Solr container (ex: during a Solr image upgrade) we'll have to craw again and wait a few days for the process to finish and in the mean time, no catalog to query !!! Wow !
Edit:
Craw 1 single .nc
file.
curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs=target_files=birdhouse/testdata/secure/tasmax_Amon_MPI-ESM-MR_rcp45_r2i1p1_200601-200612.nc"
Crawl 1 single .ncml
file:
curl --include "http://lvupavicspublic.ouranos.ca:8086/pywps?service=WPS&request=execute&version=1.0.0&identifier=pavicrawler&storeExecuteResponse=true&status=true&DataInputs=target_files=birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml"
Solr query result (from empty) containing only the .ncml
file:
$ curl --include "http://lvupavicspublic.ouranos.ca:8983/solr/birdhouse/select?q=*%3A*&rows=100&wt=json&indent=true" --silent
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"*:*",
"indent":"true",
"rows":"100",
"wt":"json"}},
"response":{"numFound":1,"start":0,"docs":[
{
"catalog_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/catalog/birdhouse/nrcan/nrcan_canada_daily/catalog.xml?dataset=birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"wms_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/ncWMS2/wms?SERVICE=WMS&REQUEST=GetCapabilities&VERSION=1.3.0&DATASET=outputs/nrcan/nrcan_canada_daily/pr-agg.ncml",
"cf_standard_name":["lwe_precipitation_rate"],
"keywords":["lwe_precipitation_rate",
"application/netcdf",
"pr",
"thredds"],
"last_modified":"2019-09-17T17:29:00Z",
"content_type":"application/netcdf",
"variable":["pr"],
"dataset_id":"nrcan.nrcan_canada_daily",
"datetime_max":"2013-12-31T00:00:00Z",
"subject":"Birdhouse Thredds Catalog",
"category":"thredds",
"opendap_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/dodsC/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"title":"pr-agg.ncml",
"url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/fileServer/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"variable_long_name":["lwe_precipitation_rate"],
"source":"https://lvupavicspublic-lvu.pagekite.me//twitcher/ows/proxy/thredds/catalog.xml",
"datetime_min":"1950-01-01T00:00:00Z",
"replica":false,
"units":["mm s-1"],
"resourcename":"birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"abstract":"birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"latest":true,
"type":"File",
"fileserver_url":"https://lvupavicspublic-lvu.pagekite.me/twitcher/ows/proxy/thredds/fileServer/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml",
"id":"c4278b4707bfe05f",
"_version_":1652212033388544000}]
}}
@tlogan2000 See above my initial investigation with Solr.
Not sure it's useful to want you need (find all missing attribute in all our data).
Right now the catalog is using the opendap link to extract meta-data. The crawler is also doing some attribute exclusion and some attribute remapping.
Probably safer to parse the XML NCML view (https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/nrcan/nrcan_canada_daily/pr-agg.ncml?catalog=https%3A%2F%2Flvupavicspublic-lvu.pagekite.me%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Fnrcan%2Fnrcan_canada_daily%2Fcatalog.html&dataset=birdhouse%2Fnrcan%2Fnrcan_canada_daily%2Fpr-agg.ncml) that do not filter anything.
We do not have existing code to parse XML and insert into Solr. This is temporary anyways, just to find missing attributes, we don't need to keep this result. Probably do not need Solr as storage. I'd suggest you insert into whatever DB you are most confortable with, as long as at the end of the day you are able to find the missing attributes list.
That said, if you prefer to insert into Solr, I can spend sometime finding how to insert XML data into Solr since long term we still need to understand how Solr works.
I think the plan was to eventually replace the current DAP based crawler by simply ingesting the NcML service output into solr. This is not strictly necessary at the moment, but something to think about.
Not using SOLR anymore.
More or less linked to this in https://github.com/Ouranosinc/PAVICS-DataCatalog/issues/52. But looking specifically at importing to the SOLR database: example NcML service:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/nrcan/nrcan_canada_daily/tasmin/nrcan_canada_daily_tasmin_2012.nc?catalog=https%3A%2F%2Fpavics.ouranos.ca%2Ftwitcher%2Fows%2Fproxy%2Fthredds%2Fcatalog%2Fbirdhouse%2Fnrcan%2Fnrcan_canada_daily%2Ftasmin%2Fcatalog.html&dataset=birdhouse%2Fnrcan%2Fnrcan_canada_daily%2Ftasmin%2Fnrcan_canada_daily_tasmin_2012.nc