Closed tlvu closed 3 years ago
Reproduced (after a few tries since this is intermittent) with CRIM's Jenkins against Hirondelle from CRIM as well:
https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/407/console
09:05:48 _ raven-master/docs/source/notebooks/Subset_climate_data_over_watershed.ipynb::Cell 1 _
09:05:48 Notebook cell execution failed
09:05:48 Cell 1: Cell execution caused an exception
09:05:48
09:05:48 Input:
09:05:48 # Set the links to the servers.
09:05:48 # Note that if Finch is a remote server, Raven needs to be accessible on the next because some cells
09:05:48 # below use the output from Raven processes to feed into Finch.
09:05:48
09:05:48 url_finch = os.environ.get(
09:05:48 "FINCH_WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/finch/wps"
09:05:48 )
09:05:48
09:05:48 url_raven = os.environ.get(
09:05:48 "WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/raven/wps"
09:05:48 )
09:05:48
09:05:48 # Establish client connexions to the remote servers
09:05:48 finch = birdy.WPSClient(url_finch)
09:05:48 raven = birdy.WPSClient(url_raven)
09:05:48
09:05:48 Traceback:
09:05:48
09:05:48 ---------------------------------------------------------------------------
09:05:48 ServiceException Traceback (most recent call last)
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48 117 try:
09:05:48 --> 118 self._wps.getcapabilities(xml=caps_xml)
09:05:48 119 except ServiceException as e:
09:05:48
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in getcapabilities(self, xml)
09:05:48 274 self._capabilities = reader.readFromUrl(
09:05:48 --> 275 self.url, headers=self.headers)
09:05:48 276
09:05:48
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in readFromUrl(self, url, username, password, headers, verify, cert)
09:05:48 554 username=username, password=password,
09:05:48 --> 555 headers=headers, verify=verify, cert=cert)
09:05:48 556
09:05:48
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in _readFromUrl(self, url, data, timeout, method, username, password, headers, verify, cert)
09:05:48 508 1], method='Get', username=self.auth.username, password=self.auth.password,
09:05:48 --> 509 headers=headers, verify=self.auth.verify, cert=self.auth.cert, timeout=self.timeout)
09:05:48 510 return etree.fromstring(u.read())
09:05:48
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
09:05:48 210 if req.status_code in [400, 401]:
09:05:48 --> 211 raise ServiceException(req.text)
09:05:48 212
09:05:48
09:05:48 ServiceException: <?xml version="1.0" encoding="utf-8"?>
09:05:48 <ExceptionReport version="1.0.0"
09:05:48 xmlns="http://www.opengis.net/ows/1.1"
09:05:48 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
09:05:48 xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
09:05:48 <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
09:05:48 <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
09:05:48 </Exception>
09:05:48 </ExceptionReport>
09:05:48
09:05:48 During handling of the above exception, another exception occurred:
09:05:48
09:05:48 UnauthorizedException Traceback (most recent call last)
09:05:48 <ipython-input-2-6e0bd7322976> in <module>
09:05:48 13 # Establish client connexions to the remote servers
09:05:48 14 finch = birdy.WPSClient(url_finch)
09:05:48 ---> 15 raven = birdy.WPSClient(url_raven)
09:05:48
09:05:48 /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48 120 if "AccessForbidden" in str(e):
09:05:48 121 raise UnauthorizedException(
09:05:48 --> 122 "You are not authorized to do a request of type: GetCapabilities"
09:05:48 123 )
09:05:48 124 raise
09:05:48
09:05:48 UnauthorizedException: You are not authorized to do a request of type: GetCapabilities
Latest run seems to try accessing this file:
"https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc"
but I don't see that when I go through Thredds' browser
Latest run seems to try accessing this file:
"https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc"
but I don't see that when I go through Thredds' browser
Oh right, that data and all Raven testdata can be deployed by activating https://github.com/bird-house/birdhouse-deploy/blob/d90765acabe248e65c4899929fbe37a9e8661643/birdhouse/env.local.example#L180-L184 on Hirondelle.
Just to be clear, missing that testdata file is un-related to the UnauthorizedException: You are not authorized to do a request of type: GetCapabilities
intermittent error. We have this same intermittent error on our prod and our prod has all the required testdata.
For reference from https://github.com/Ouranosinc/PAVICS-e2e-workflow-tests/pull/74#issuecomment-849161468 Possible lead:
Below is docker logs twitcher
on Medus matching the timestamp for the Flyingpigeon 500
failure in https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/stress-test/10/console. There was a hit on other services during the 100 hit on Flyingpigeon. I suspect the other hits are from the Canarie-api monitoring that hits all the WPS every minutes. Otherwise, this is a staging server, much less activity than the production server.
There was no errors in Magpie logs.
2021-05-26 21:22:02,150 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:02,174 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,397 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,403 INFO [TWITCHER:132][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:02,430 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,652 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,659 INFO [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:02,691 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,737 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,743 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/ncWMS2/wms'
2021-05-26 21:22:02,761 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,898 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,901 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:02,907 WARNI [TWITCHER:100][waitress-1] Using cached service
2021-05-26 21:22:02,908 ERROR [TWITCHER:33][waitress-1] unknown error
Traceback (most recent call last):
File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
security.check_request(request)
File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
permission_requested = service_impl.permission_requested()
File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-05-26 21:22:02,921 INFO [TWITCHER:132][waitress-2] 'None' request 'browse' permission on '/ows/proxy/thredds/catalog.html'
2021-05-26 21:22:02,941 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,009 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,016 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,040 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,225 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,231 INFO [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,256 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,284 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
FYI, we also started to see intermittent Authorization failure for Thredds data as well, in our nightly prod run http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1135/console.
OSError: [Errno -78] NetCDF: Authorization failure: b'https://boreas.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml'
The error above is not supposed to be able to happen since everything is public on our Thredds for the moment.
Will try to find logs but since this is a production server, there are many activities, not sure I can pinpoint the good logs.
Matching logs for https://github.com/Ouranosinc/Magpie/issues/433#issuecomment-879272044
2021-07-13 00:39:04,607 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/hummingbird'
2021-07-13 00:39:04,627 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,643 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,647 INFO [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,668 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,687 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,692 INFO [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,714 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,744 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,746 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,750 WARNI [TWITCHER:100][waitress-1] Using cached service
2021-07-13 00:39:04,750 ERROR [TWITCHER:33][waitress-1] unknown error
Traceback (most recent call last):
File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
security.check_request(request)
File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
permission_requested = service_impl.permission_requested()
File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-07-13 00:39:04,751 INFO [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,773 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,783 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,789 INFO [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/wms/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc'
2021-07-13 00:39:04,794 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,807 INFO [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?
cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false
@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?
cache.enabled = false cache.acl.enabled = false cache.service.enabled = false
@fmigneault you want to add these to prod even if the PR about cache config https://github.com/bird-house/birdhouse-deploy/pull/174 is not merged yet? You suspect portions of the caching mechanism has been enabled?
@tlvu Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?
cache.enabled = false cache.acl.enabled = false cache.service.enabled = false
@fmigneault done: https://github.com/Ouranosinc/birdhouse-deploy/commit/5c23e8b9edd115dd82e3665cb0f40dd36c408f44 (can you double check I've inserted into the proper location in those 2 config files). I've done ./pavics-compose.sh restart twitcher magpie
for these changes to take effect. Let me know if restart is enough for I have to destroy and recreate those containers.
@fmigneault FYI, I ran Jenkins on our prod after those cache config you proposed in https://github.com/Ouranosinc/Magpie/issues/433#issuecomment-879389809 are applied, the stress-test.ipynb
still fail (https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/fix-stress-tests.ipynb-not-showing-error-in-jenkins/6/console) and I still have the following stack trace in Twitcher logs:
2021-07-13 22:16:18,506 INFO [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'
2021-07-13 22:16:18,543 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,623 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,632 INFO [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.das'
2021-07-13 22:16:18,659 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,769 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,772 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service
2021-07-13 22:16:18,779 ERROR [TWITCHER:33][waitress-3] unknown error
Traceback (most recent call last):
File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
security.check_request(request)
File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
permission_requested = service_impl.permission_requested()
File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-07-13 22:16:18,780 INFO [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'
2021-07-13 22:16:18,813 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,853 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,859 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'
2021-07-13 22:16:18,881 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,896 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 22:16:18,901 INFO [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dods'
@tlvu The settings look all right, but the line
2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service
indicates that somehow caching is still enabled.
Error
raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
is a side effect of https://github.com/Ouranosinc/Magpie/pull/439 This error can occur only when caching is active, otherwise the request is parsed from scratch each time.
I'm fine with having the cache settings disabled until bird-house/birdhouse-deploy#174 is working with all tests. I'm not sure why instances see them as enabled though... toggling those worked when I was creating the stress-test script. It is pointing to a server that applied them?
@fmigneault I've destroyed and recreated twitcher and magpie container on our production to ensure the config that force disable the cache are active. Still have these errors in the twitcher logs when running stress-test.ipynb
(https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/497/console)
2021-07-14 23:03:30,178 INFO [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCCD
_gen3_temperature.ncml.dods'
2021-07-14 23:03:30,201 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,342 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,348 INFO [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'
2021-07-14 23:03:30,370 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>' 2021-07-14 23:03:30,623 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,627 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,630 WARNI [TWITCHER:100][waitress-3] Using cached service
2021-07-14 23:03:30,630 ERROR [TWITCHER:33][waitress-3] unknown error
Traceback (most recent call last):
File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
security.check_request(request)
File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
permission_requested = service_impl.permission_requested()
File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-07-14 23:03:30,635 INFO [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCC$_gen3_temperature.ncml.dods'
2021-07-14 23:03:30,658 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,732 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,737 INFO [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'
2021-07-14 23:03:30,759 INFO [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-14 23:03:30,993 INFO [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
So I do not think the following configs are able to disable the cache.
cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false
@fmigneault After PR https://github.com/bird-house/birdhouse-deploy/pull/182 is merged, our Jenkins nightly in prod found 408 code in the stress-test.ipynb
this morning. I have never seen this 408 before. Previously it was 500 and 401 only. Should I open a new different issue or keep the discussion here?
docker logs twitcher
has a few security check failed
but no stack-trace at all.
2021-08-03 14:01:49,296 INFO [TWITCHER:158][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-08-03 14:01:49,321 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-08-03 14:01:49,407 WARNI [waitress.queue:117][MainThread] Task queue depth is 1
2021-08-03 14:01:49,430 INFO [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-08-03 14:01:49,437 WARNI [TWITCHER:30][waitress-2] security check failed.
Magpie logs has nothing suspicious.
https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0, the PR bird-house/birdhouse-deploy#182 has not been autodeployed?
I think its ok to leave it in the same issue since the cause could be off same nature.
408 is generated by the stress test that defines timeout=int(os.getenv("TEST_TIMEOUT_ABORT", 5))
when submitting requests.
I don't know why that WPS was slow to answer though, especially that caching is not involved.
Maybe it was a one-off case that just so happened to reach the limit, but usually it is much lower than that (~0.1s).
For hirondelle, I'm not sure if it is auto-deployed or done manually by @matprov I cannot see any jenkins job related to that, Maybe you know @dbyrns ?
No sorry, I cannot answer that for @matprov. Outarde seems down right now. Maybe the update didn't work well?
Outarde is also on magpie:3.12 Not sure why it is not up. It is extremely slow. I get following error:
> pavics-compose ps
reading './components/monitoring/default.env'
COMPOSE_CONF_LIST=-f docker-compose.yml -f ./components/scheduler/docker-compose-extra.yml -f ./components/monitoring/docker-compose-extra.yml
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
I've never seen that one before.
@fmigneault Outarde is now up. Docker seems to have misbehaved and needed some love on Outarde.
For hirondelle, I'm not sure if it is auto-deployed or done manually
What exactly do you make reference to? New version of magpie?
@matprov
Since you were in vacation, I tried looking into re-starting pavics-compose up -d
to reboot the stack, but I wasn't sure if there was some auto-deploy/update of images involved for this server with a Jenkins task or if it was done manually by ssh'ing on the server.
@fmigneault Usually a simple pavics-compose up -d
is enough. However, in this case, the Docker daemon was unreachable for an unknown reason. It had to be restarted.
Also, the docker-proxy
was still proxying some non-existent containers, which led to an impossibility to restart the stack, since ports were already in use. This was the first time it happened, for an unknown reason. Maybe Docker forgot to clean itself during the restart of the service, which led to proxy issues.
Following these two issues and the fact that staging environment was running smoothly, it's safe to say that the issue isn't stack related, but Docker operationalization related.
FYI both https://outarde.crim.ca/magpie/version and https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0 as of this writting. Probably missing the manual pull on your production fork https://github.com/crim-ca/birdhouse-deploy? I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?
I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?
This might be a good idea, in order to test the latest changes on hirondelle without having to update the fork.
@dbyrns What's your position on this one? Staging and prod are both pointing towards our fork right now, but we might benefit from pointing staging to the official repo.
@matprov
What is the purpose of that fork? I don't see any diff between master
of both repos. Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure?
As far as I'm aware, they could use bird-house repo directly now.
Personally, I would like to have hirondelle/staging auto-update to master for regularly testing recent changes, and leave outarde/prod with manual update.
Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure?
@fmigneault Actually no. They are on the fork to prevent prod auto-update. Using the fork repo makes sure we are not updating everytime we merge on birdhouse/birdhouse-deploy. And yes, auto-updating staging (so pointing towards the real repo, not the fork) makes sense for QA.
I don't see any diff between master of both repos.
I synced the fork last week.
@matprov I'm not sure what is causing the auto-update you are mentioning, and why/how it avoids it. Given the branches have no diff, whatever causes auto-update is an external cron job or a custom config not staged in the main repo.
I agree. Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.
..whatever causes auto-update is an external cron job or a custom config not staged in the main repo.
@fmigneault It's the autodeploy component from birdhouse. Autodeploy via this component is enabled on both staging (every 10 minutes) and prod instance (at midnight). Now, to avoid autodeploy, it all depends which is the remote. Either the "real" repo or the fork.
@matprov @tlvu
Wouldn't removing the ./components/scheduler
from EXTRA_CONF_DIRS
disable the auto-deploy?
@fmigneault We want to keep the auto-deploy, but having prod synced on its own fork (see https://birdhouse-deploy.readthedocs.io/en/latest/contributing.html) instead of the birdhouse/birdhouse-deploy repo. Changing staging's remote to birdhouse/birdhouse-deploy repo will do the trick.
Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.
Another reason for the staging env to follow the real repo is to catch errors earlier and allow for "hot fixes" to avoid the previous magpie rollback. This is what I understood. You guys prefer hot fixes than rollback so the staging env will be able to test the hot fix. The fork is only needed to ensure production stability.
For the record, our staging env (medus) is also on the real repo. Only our production is on the fork.
@tlvu I totally agree, this is the way to go. Hirondelle now points to the real repo.
Describe the bug Running Jenkins notebook test suite against Ouranos production PAVICS stack we randomly get
and
Full run in http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1047/console
Run the same test suite again, and the error is gone.
Note on Ouranos production server, all WPS services have full public access on anonymous group.
No error seen in
docker logs magpie
anddocker logs twitcher
.To Reproduce Steps to reproduce the behavior:
--nbval-lax
activated, see Build request on CRIM side https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/402/parameters/Expected behavior Should not have those
UnauthorizedException