Ouranosinc / Magpie

AuthN/AuthZ services
https://pavics-magpie.readthedocs.io
Apache License 2.0
1 stars 5 forks source link

[BUG] Intermittent UnauthorizedException for WPS processes behind Magpie 3.8.0 and matching Twitcher #433

Closed tlvu closed 3 years ago

tlvu commented 3 years ago

Describe the bug Running Jenkins notebook test suite against Ouranos production PAVICS stack we randomly get

11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
11:22:04      210     if req.status_code in [400, 401]:
11:22:04  --> 211         raise ServiceException(req.text)
11:22:04      212 
11:22:04  
11:22:04  ServiceException: <?xml version="1.0" encoding="utf-8"?>
11:22:04  <ExceptionReport version="1.0.0"
11:22:04      xmlns="http://www.opengis.net/ows/1.1"
11:22:04      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
11:22:04      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
11:22:04      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
11:22:04          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
11:22:04      </Exception>
11:22:04  </ExceptionReport>
11:22:04  
11:22:04  During handling of the above exception, another exception occurred:
11:22:04  
11:22:04  UnauthorizedException                     Traceback (most recent call last)
11:22:04  <ipython-input-5-a503678973ad> in <module>
11:22:04       30     elevation=basin_elevation,
11:22:04       31     area=basin_area,
11:22:04  ---> 32     name="test_basin",
11:22:04       33 )
11:22:04       34 
11:22:04  
11:22:04  </opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py-613> in climatology_esp(self, forecast_date, ts, model_name, params, forecast_duration, name, area, latitude, longitude, elevation, nc_spec, rvc, rain_snow_fraction, output_formats)
11:22:04  
11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in _execute(self, pid, **kwargs)
11:22:04      357             if "AccessForbidden" in str(e):
11:22:04      358                 raise UnauthorizedException(
11:22:04  --> 359                     "You are not authorized to do a request of type: Execute"
11:22:04      360                 )
11:22:04      361             raise
11:22:04  
11:22:04  UnauthorizedException: You are not authorized to do a request of type: Execute

and

11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
11:22:04      210     if req.status_code in [400, 401]:
11:22:04  --> 211         raise ServiceException(req.text)
11:22:04      212 
11:22:04  
11:22:04  ServiceException: <?xml version="1.0" encoding="utf-8"?>
11:22:04  <ExceptionReport version="1.0.0"
11:22:04      xmlns="http://www.opengis.net/ows/1.1"
11:22:04      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
11:22:04      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
11:22:04      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
11:22:04          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
11:22:04      </Exception>
11:22:04  </ExceptionReport>
11:22:04  
11:22:04  During handling of the above exception, another exception occurred:
11:22:04  
11:22:04  UnauthorizedException                     Traceback (most recent call last)
11:22:04  <ipython-input-2-326fb088ffcd> in <module>
11:22:04        4 raven_url = os.environ.get("WPS_URL", f"{pavics_url}/twitcher/ows/proxy/raven/wps")
11:22:04        5 
11:22:04  ----> 6 raven = birdy.WPSClient(raven_url)
11:22:04  
11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
11:22:04      120             if "AccessForbidden" in str(e):
11:22:04      121                 raise UnauthorizedException(
11:22:04  --> 122                     "You are not authorized to do a request of type: GetCapabilities"
11:22:04      123                 )
11:22:04      124             raise
11:22:04  
11:22:04  UnauthorizedException: You are not authorized to do a request of type: GetCapabilities

Full run in http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1047/console

Run the same test suite again, and the error is gone.

Note on Ouranos production server, all WPS services have full public access on anonymous group.

No error seen in docker logs magpie and docker logs twitcher.

To Reproduce Steps to reproduce the behavior:

  1. Deploy birdhouse-deploy PAVICS docker-compose stack at tag https://github.com/bird-house/birdhouse-deploy/tree/1.13.0
  2. Request Jenkins build with only Raven notebooks and with --nbval-lax activated, see Build request on CRIM side https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/402/parameters/

Expected behavior Should not have those UnauthorizedException

tlvu commented 3 years ago

Reproduced (after a few tries since this is intermittent) with CRIM's Jenkins against Hirondelle from CRIM as well:

https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/407/console

09:05:48  _ raven-master/docs/source/notebooks/Subset_climate_data_over_watershed.ipynb::Cell 1 _
09:05:48  Notebook cell execution failed
09:05:48  Cell 1: Cell execution caused an exception
09:05:48  
09:05:48  Input:
09:05:48  # Set the links to the servers. 
09:05:48  # Note that if Finch is a remote server, Raven needs to be accessible on the next because some cells 
09:05:48  # below use the output from Raven processes to feed into Finch. 
09:05:48  
09:05:48  url_finch = os.environ.get(
09:05:48      "FINCH_WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/finch/wps"
09:05:48  )
09:05:48  
09:05:48  url_raven = os.environ.get(
09:05:48      "WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/raven/wps"
09:05:48  )
09:05:48  
09:05:48  # Establish client connexions to the remote servers
09:05:48  finch = birdy.WPSClient(url_finch)  
09:05:48  raven = birdy.WPSClient(url_raven)
09:05:48  
09:05:48  Traceback:
09:05:48  
09:05:48  ---------------------------------------------------------------------------
09:05:48  ServiceException                          Traceback (most recent call last)
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48      117         try:
09:05:48  --> 118             self._wps.getcapabilities(xml=caps_xml)
09:05:48      119         except ServiceException as e:
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in getcapabilities(self, xml)
09:05:48      274             self._capabilities = reader.readFromUrl(
09:05:48  --> 275                 self.url, headers=self.headers)
09:05:48      276 
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in readFromUrl(self, url, username, password, headers, verify, cert)
09:05:48      554                                  username=username, password=password,
09:05:48  --> 555                                  headers=headers, verify=verify, cert=cert)
09:05:48      556 
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in _readFromUrl(self, url, data, timeout, method, username, password, headers, verify, cert)
09:05:48      508                         1], method='Get', username=self.auth.username, password=self.auth.password,
09:05:48  --> 509                         headers=headers, verify=self.auth.verify, cert=self.auth.cert, timeout=self.timeout)
09:05:48      510             return etree.fromstring(u.read())
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
09:05:48      210     if req.status_code in [400, 401]:
09:05:48  --> 211         raise ServiceException(req.text)
09:05:48      212 
09:05:48  
09:05:48  ServiceException: <?xml version="1.0" encoding="utf-8"?>
09:05:48  <ExceptionReport version="1.0.0"
09:05:48      xmlns="http://www.opengis.net/ows/1.1"
09:05:48      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
09:05:48      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
09:05:48      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
09:05:48          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
09:05:48      </Exception>
09:05:48  </ExceptionReport>
09:05:48  
09:05:48  During handling of the above exception, another exception occurred:
09:05:48  
09:05:48  UnauthorizedException                     Traceback (most recent call last)
09:05:48  <ipython-input-2-6e0bd7322976> in <module>
09:05:48       13 # Establish client connexions to the remote servers
09:05:48       14 finch = birdy.WPSClient(url_finch)
09:05:48  ---> 15 raven = birdy.WPSClient(url_raven)
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48      120             if "AccessForbidden" in str(e):
09:05:48      121                 raise UnauthorizedException(
09:05:48  --> 122                     "You are not authorized to do a request of type: GetCapabilities"
09:05:48      123                 )
09:05:48      124             raise
09:05:48  
09:05:48  UnauthorizedException: You are not authorized to do a request of type: GetCapabilities
fmigneault commented 3 years ago

Latest run seems to try accessing this file: "https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc" but I don't see that when I go through Thredds' browser

tlvu commented 3 years ago

Latest run seems to try accessing this file: "https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc" but I don't see that when I go through Thredds' browser

Oh right, that data and all Raven testdata can be deployed by activating https://github.com/bird-house/birdhouse-deploy/blob/d90765acabe248e65c4899929fbe37a9e8661643/birdhouse/env.local.example#L180-L184 on Hirondelle.

Just to be clear, missing that testdata file is un-related to the UnauthorizedException: You are not authorized to do a request of type: GetCapabilities intermittent error. We have this same intermittent error on our prod and our prod has all the required testdata.

fmigneault commented 3 years ago

For reference from https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests/pull/74#issuecomment-849161468 Possible lead:

Below is docker logs twitcher on Medus matching the timestamp for the Flyingpigeon 500 failure in https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/stress-test/10/console. There was a hit on other services during the 100 hit on Flyingpigeon. I suspect the other hits are from the Canarie-api monitoring that hits all the WPS every minutes. Otherwise, this is a staging server, much less activity than the production server.

There was no errors in Magpie logs.

2021-05-26 21:22:02,150 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,174 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,397 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,403 INFO  [TWITCHER:132][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,430 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,652 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,659 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,691 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,737 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,743 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/ncWMS2/wms'                       
2021-05-26 21:22:02,761 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,898 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,901 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,907 WARNI [TWITCHER:100][waitress-1] Using cached service                                                                         
2021-05-26 21:22:02,908 ERROR [TWITCHER:33][waitress-1] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                                        
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                                  
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-05-26 21:22:02,921 INFO  [TWITCHER:132][waitress-2] 'None' request 'browse' permission on '/ows/proxy/thredds/catalog.html'
2021-05-26 21:22:02,941 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,009 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,016 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,040 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,225 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,231 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,256 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,284 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
tlvu commented 3 years ago

FYI, we also started to see intermittent Authorization failure for Thredds data as well, in our nightly prod run http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1135/console.

OSError: [Errno -78] NetCDF: Authorization failure: b'https://boreas.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml'

The error above is not supposed to be able to happen since everything is public on our Thredds for the moment.

Will try to find logs but since this is a production server, there are many activities, not sure I can pinpoint the good logs.

tlvu commented 3 years ago

Matching logs for https://github.com/Ouranosinc/Magpie/issues/433#issuecomment-879272044

2021-07-13 00:39:04,607 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/hummingbird'
2021-07-13 00:39:04,627 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,643 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,647 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,668 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,687 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,692 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,714 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,744 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,746 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,750 WARNI [TWITCHER:100][waitress-1] Using cached service
2021-07-13 00:39:04,750 ERROR [TWITCHER:33][waitress-1] unknown error
Traceback (most recent call last):
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
    security.check_request(request)
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
    permission_requested = service_impl.permission_requested()
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-07-13 00:39:04,751 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,773 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,783 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,789 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/wms/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc'
2021-07-13 00:39:04,794 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,807 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
fmigneault commented 3 years ago

@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false
tlvu commented 3 years ago

@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@fmigneault you want to add these to prod even if the PR about cache config https://github.com/bird-house/birdhouse-deploy/pull/174 is not merged yet? You suspect portions of the caching mechanism has been enabled?

tlvu commented 3 years ago

@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@fmigneault done: https://github.com/Ouranosinc/birdhouse-deploy/commit/5c23e8b9edd115dd82e3665cb0f40dd36c408f44 (can you double check I've inserted into the proper location in those 2 config files). I've done ./pavics-compose.sh restart twitcher magpie for these changes to take effect. Let me know if restart is enough for I have to destroy and recreate those containers.

tlvu commented 3 years ago

@fmigneault FYI, I ran Jenkins on our prod after those cache config you proposed in https://github.com/Ouranosinc/Magpie/issues/433#issuecomment-879389809 are applied, the stress-test.ipynb still fail (https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/fix-stress-tests.ipynb-not-showing-error-in-jenkins/6/console) and I still have the following stack trace in Twitcher logs:

2021-07-13 22:16:18,506 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'                            
2021-07-13 22:16:18,543 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,623 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,632 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.das'                            
2021-07-13 22:16:18,659 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,769 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,772 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service                                                                         
2021-07-13 22:16:18,779 ERROR [TWITCHER:33][waitress-3] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                                        
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                                  
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-07-13 22:16:18,780 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'                            
2021-07-13 22:16:18,813 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,853 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,859 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-13 22:16:18,881 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,896 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,901 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dods'    
fmigneault commented 3 years ago

@tlvu The settings look all right, but the line

2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service

indicates that somehow caching is still enabled.

Error

 raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))     

is a side effect of https://github.com/Ouranosinc/Magpie/pull/439 This error can occur only when caching is active, otherwise the request is parsed from scratch each time.

I'm fine with having the cache settings disabled until bird-house/birdhouse-deploy#174 is working with all tests. I'm not sure why instances see them as enabled though... toggling those worked when I was creating the stress-test script. It is pointing to a server that applied them?

tlvu commented 3 years ago

@fmigneault I've destroyed and recreated twitcher and magpie container on our production to ensure the config that force disable the cache are active. Still have these errors in the twitcher logs when running stress-test.ipynb (https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/497/console)

2021-07-14 23:03:30,178 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCCD
_gen3_temperature.ncml.dods'                                                                                                                          
2021-07-14 23:03:30,201 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,342 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,348 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-14 23:03:30,370 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       2021-07-14 23:03:30,623 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,627 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,630 WARNI [TWITCHER:100][waitress-3] Using cached service                                                                         
2021-07-14 23:03:30,630 ERROR [TWITCHER:33][waitress-3] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                      
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-07-14 23:03:30,635 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCC$_gen3_temperature.ncml.dods'                                                                                                        
2021-07-14 23:03:30,658 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,732 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,737 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-14 23:03:30,759 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,993 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'       

So I do not think the following configs are able to disable the cache.

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false
tlvu commented 3 years ago

@fmigneault After PR https://github.com/bird-house/birdhouse-deploy/pull/182 is merged, our Jenkins nightly in prod found 408 code in the stress-test.ipynb this morning. I have never seen this 408 before. Previously it was 500 and 401 only. Should I open a new different issue or keep the discussion here?

docker logs twitcher has a few security check failed but no stack-trace at all.

2021-08-03 14:01:49,296 INFO  [TWITCHER:158][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-08-03 14:01:49,321 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-08-03 14:01:49,407 WARNI [waitress.queue:117][MainThread] Task queue depth is 1                                                                  
2021-08-03 14:01:49,430 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-08-03 14:01:49,437 WARNI [TWITCHER:30][waitress-2] security check failed.  

Magpie logs has nothing suspicious.

tlvu commented 3 years ago

https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0, the PR bird-house/birdhouse-deploy#182 has not been autodeployed?

fmigneault commented 3 years ago

I think its ok to leave it in the same issue since the cause could be off same nature.

408 is generated by the stress test that defines timeout=int(os.getenv("TEST_TIMEOUT_ABORT", 5)) when submitting requests. I don't know why that WPS was slow to answer though, especially that caching is not involved. Maybe it was a one-off case that just so happened to reach the limit, but usually it is much lower than that (~0.1s).

For hirondelle, I'm not sure if it is auto-deployed or done manually by @matprov I cannot see any jenkins job related to that, Maybe you know @dbyrns ?

dbyrns commented 3 years ago

No sorry, I cannot answer that for @matprov. Outarde seems down right now. Maybe the update didn't work well?

fmigneault commented 3 years ago

Outarde is also on magpie:3.12 Not sure why it is not up. It is extremely slow. I get following error:

> pavics-compose ps
reading './components/monitoring/default.env'
COMPOSE_CONF_LIST=-f docker-compose.yml -f ./components/scheduler/docker-compose-extra.yml -f ./components/monitoring/docker-compose-extra.yml
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

I've never seen that one before.

matprov commented 3 years ago

@fmigneault Outarde is now up. Docker seems to have misbehaved and needed some love on Outarde.

For hirondelle, I'm not sure if it is auto-deployed or done manually

What exactly do you make reference to? New version of magpie?

fmigneault commented 3 years ago

@matprov Since you were in vacation, I tried looking into re-starting pavics-compose up -d to reboot the stack, but I wasn't sure if there was some auto-deploy/update of images involved for this server with a Jenkins task or if it was done manually by ssh'ing on the server.

matprov commented 3 years ago

@fmigneault Usually a simple pavics-compose up -d is enough. However, in this case, the Docker daemon was unreachable for an unknown reason. It had to be restarted.

Also, the docker-proxy was still proxying some non-existent containers, which led to an impossibility to restart the stack, since ports were already in use. This was the first time it happened, for an unknown reason. Maybe Docker forgot to clean itself during the restart of the service, which led to proxy issues.

Following these two issues and the fact that staging environment was running smoothly, it's safe to say that the issue isn't stack related, but Docker operationalization related.

tlvu commented 3 years ago

FYI both https://outarde.crim.ca/magpie/version and https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0 as of this writting. Probably missing the manual pull on your production fork https://github.com/crim-ca/birdhouse-deploy? I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?

matprov commented 3 years ago

I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?

This might be a good idea, in order to test the latest changes on hirondelle without having to update the fork.

@dbyrns What's your position on this one? Staging and prod are both pointing towards our fork right now, but we might benefit from pointing staging to the official repo.

fmigneault commented 3 years ago

@matprov What is the purpose of that fork? I don't see any diff between master of both repos. Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure? As far as I'm aware, they could use bird-house repo directly now.

Personally, I would like to have hirondelle/staging auto-update to master for regularly testing recent changes, and leave outarde/prod with manual update.

matprov commented 3 years ago

Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure?

@fmigneault Actually no. They are on the fork to prevent prod auto-update. Using the fork repo makes sure we are not updating everytime we merge on birdhouse/birdhouse-deploy. And yes, auto-updating staging (so pointing towards the real repo, not the fork) makes sense for QA.

I don't see any diff between master of both repos.

I synced the fork last week.

fmigneault commented 3 years ago

@matprov I'm not sure what is causing the auto-update you are mentioning, and why/how it avoids it. Given the branches have no diff, whatever causes auto-update is an external cron job or a custom config not staged in the main repo.

dbyrns commented 3 years ago

I agree. Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.

matprov commented 3 years ago

..whatever causes auto-update is an external cron job or a custom config not staged in the main repo.

@fmigneault It's the autodeploy component from birdhouse. Autodeploy via this component is enabled on both staging (every 10 minutes) and prod instance (at midnight). Now, to avoid autodeploy, it all depends which is the remote. Either the "real" repo or the fork.

fmigneault commented 3 years ago

@matprov @tlvu Wouldn't removing the ./components/scheduler from EXTRA_CONF_DIRS disable the auto-deploy?

matprov commented 3 years ago

@fmigneault We want to keep the auto-deploy, but having prod synced on its own fork (see https://birdhouse-deploy.readthedocs.io/en/latest/contributing.html) instead of the birdhouse/birdhouse-deploy repo. Changing staging's remote to birdhouse/birdhouse-deploy repo will do the trick.

tlvu commented 3 years ago

Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.

Another reason for the staging env to follow the real repo is to catch errors earlier and allow for "hot fixes" to avoid the previous magpie rollback. This is what I understood. You guys prefer hot fixes than rollback so the staging env will be able to test the hot fix. The fork is only needed to ensure production stability.

For the record, our staging env (medus) is also on the real repo. Only our production is on the fork.

matprov commented 3 years ago

@tlvu I totally agree, this is the way to go. Hirondelle now points to the real repo.