ESGF / esg-publisher

ESGF Publisher
http://esg-publisher.readthedocs.org/
9 stars 22 forks source link

Several versions into one mapfile leads to a "silent failed" publication on indexnode #20

Open glevava opened 8 years ago

glevava commented 8 years ago

It appears that concatenating several mapfile of different versions of the same dataset leads to a "silent failed" publication on indexnode.

For instance, here are two CORDEX mapfiles:

> ls mapfiles/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v201*
-rw-r--r--. 1 levavasg 2.0K Dec 17 11:36 mapfiles/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v20140301
-rw-r--r--. 1 levavasg 2.9K Dec 17 11:36 mapfiles/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v20150515

We concatenate both mapfiles to publish in a bulk. The sort ensures that the versions are chronologically published as enforced by the new publisher (ESGF 2.x):

> cat mapfiles/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v201* | sort >> mapfile_test.txt

Publication on datanode goes right for both versions without any warnings or error:

> esgpublish -i /esg/config/esgcet/esg.ini --thredds --service fileservice --map mapfile_test.txt 
INFO       2016-02-09 17:41:00,513 Using project name = cordex
INFO       2016-02-09 17:41:00,682 Creating dataset: cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl
INFO       2016-02-09 17:41:00,683 Scanning [...]
INFO       2016-02-09 17:41:00,850 New dataset version = 20140301
INFO       2016-02-09 17:41:00,850 Adding file info to database
INFO       2016-02-09 17:41:01,007 Aggregating variables
INFO       2016-02-09 17:41:01,360 Scanning [...]
INFO       2016-02-09 17:41:01,674 New dataset version = 20150515
INFO       2016-02-09 17:41:01,954 Adding file info to database
INFO       2016-02-09 17:41:01,970 Aggregating variables
INFO       2016-02-09 17:41:02,343 Writing THREDDS catalog /esg/content/thredds/esgcet/1/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v20140301.xml
INFO       2016-02-09 17:41:02,387 Writing THREDDS catalog /esg/content/thredds/esgcet/1/cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl.v20150515.xml
INFO       2016-02-09 17:41:02,422 Writing THREDDS ESG master catalog /esg/content/thredds/esgcet/catalog.xml
INFO       2016-02-09 17:41:02,425 Reinitializing THREDDS server

Both XML catalogs are generated and files and aggregations are accessible through all protocols (OpenDAP, HTTP).

Publication on index node goes right without any warnings or error:

> esgpublish -i /esg/config/esgcet/esg.ini --publish --service fileservice --map mapfile_test.txt 
INFO       2016-02-09 17:41:02,580 Publishing: cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl
INFO       2016-02-09 17:41:03,111   Result: SUCCESSFUL
INFO       2016-02-09 17:41:03,112 Publishing: cordex.output.EUR-11.IPSL-INERIS.IPSL-IPSL-CM5A-MR.historical.r1i1p1.IPSL-INERIS-WRF331F.v1.mon.evspsbl
INFO       2016-02-09 17:41:03,619   Result: SUCCESSFUL

Nevertheless, only the latest version (v20150515 in our case) is published on indexnode. There are two publication instance for the same dataset but it seems that the new publisher only takes into account the latest version for each publication instance on the indexnode.

Is this new behavior expected by the new publisher (included in ESGF 2.x) or not?

sashakames commented 8 years ago

I'm not surprised by this behavior, as the previous use was not to publish multiple versions in bulk but publish each dataset version as they are generated. (so the next one would be a bit later). The obvious solution is to publish each version to postgresql, thredds and solr in succession before publishing the next. If this is too much of a hassle, we'd need to determine if the hessian service can take a specific thredds version as a parameter and publish that, or will it always just stick with the most recent. Then check if the publisher includes that too,

sashakames commented 4 years ago

I want to keep this in mind for the new publisher wrt versioning. A requirement is that it should handle the publication of an old version of a dataset if both maps are required, but the new version must be detected so the old can be flagged latest=false

sashakames commented 4 years ago

We haven't considered the use case of multiple datasets per mapfile for bulk publishing, but that is likely going to be a workflow issue rather than a component.