bird-house / twitcher

Security Proxy for Web Processing Services (WPS)
http://twitcher.readthedocs.io/en/latest/
Apache License 2.0
16 stars 4 forks source link

Intermittant thredds read errors behind twitcher #123

Open tlogan2000 opened 1 year ago

tlogan2000 commented 1 year ago

Describe the bug

We have noticed intermittant netcdf read errors when accessing opendap links in the PAVICS jupyterhub. Approximate date when the problem began ~ March 01 2023

To Reproduce

The intermittant nature of this problem makes reproducing somewhat difficult but a public notebook on the PAVICS server is available here : https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Random_Thredds_read_errors.ipynb

The notebook will execute a relatively large workflow and uses multiple dask worker processes to accentuate the possibility of a read error

Multiple notebook runs (note only ~5-6 due to time needed) have shown that bypassing twitcher (i.e. thredds with nginx proxy only) always allows successful completion of the calculations whereas accessing opendap links behind the nginx/twitcher combinations typically results in a read error relatively quickly in the workflow.

Although not quantified there also seems to be a general performance hit when accessing data via nginx/twitcher (print outs of execution times in the workflow loop between 25-40 sec with nginx/twitcher versus 18-30 seconds with nginx only). Note also, that the notebook runs 'nginx-only' code first so I do not believe the performance difference is benefiting from caching of data or if it is should benefit the 'twitcher/nginx' run.

Expected behavior

Execution of code without read error

fmigneault commented 1 year ago

@tlogan2000 @tlvu

The last error in the notebook displays this link as the cause of error: https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml Which returns the following when I call it manually:

Error {
    code = 400;
    message = "Unrecognized request";
};

This looks like the error I usually get when doing an incorrectly formed request towards THREDDS.

Using the corresponding https://pavics.ouranos.ca/testthredds/catalog/testdatasets/CanDCS-U6/catalog.html?dataset=testdatasets/CanDCS-U6/day_BCCAQv2%2BANUSPLIN300_MPI-ESM1-2-LR_historical%2Bssp245_r1i1p1f1_gn_1950-2100.ncml produces the expected results.

I'm guessing there must be a parameter that is incorrectly parsed or forwarded somewhere, which causes the Twitcher redirect to be incorrect. I notice that the testthredds URL encodes the literal + as %2B (as it should, otherwise + = space) when I hover the links in THREDDS. The Twitcher-protected THREDDS using the Catalog links is also valid, and returns the right response, but not with the OpenDAP generated link:

So, I'm not sure where the notebook retrieves the badly formed link from, without the appropriate URL encoding, and why it doesn't do the same problem using testthredds. Maybe THREDDS internally handles the auto-redirect when it detects literal + (?), but Twitcher doesn't (and shouldn't) do it.

Is it possible to check if some Magpie/Twitcher/Nginx logs are produced on the Ouranos instance when calling the first link (maybe need to start them with DEBUG to getter more descriptive logs)?

tlogan2000 commented 1 year ago

Using the corresponding https://pavics.ouranos.ca/testthredds/catalog/testdatasets/CanDCS-U6/catalog.html?dataset=testdatasets/CanDCS-U6/day_BCCAQv2%2BANUSPLIN300_MPI-ESM1-2-LR_historical%2Bssp245_r1i1p1f1_gn_1950-2100.ncml produces the expected results.

Just to clarify this corresponding link is not the opendap data url as it's pointing to the thredds catalog for the dataset (note the missing the dodsC portion of the url) exposing the various data access and metadata services: https, opendap, netcdfsubset etc... The exact equivalent would be https://pavics.ouranos.ca/testthredds/dodsC/testdatasets/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml which gives the same unrecognized request error. This is the 'normal' response that i have always seen when manually navigating to an opendap link. However the xr.open_dataset(url) call will succeed in reading the data

import xarray as xr
url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml"
url1 = "https://pavics.ouranos.ca/testthredds/dodsC/testdatasets/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml"

display(xr.open_dataset(url))
display(xr.open_dataset(url1))

Both of the above examples work when run as single lines, but the Twitcher version will result in a read error a very small percentage of the time making long running workflow calculations problematic

tlvu commented 1 year ago

Old Twitcher/Magpie performance problem that might be related to the performance hit observed by Travis: https://github.com/bird-house/twitcher/issues/97#issuecomment-777808757. I think this "fixed" issue has not fixed completely the problem.

We have 2 problems: performance and intermittent access/read error.

The intermittent read error problem faced by Travis is with caching disabled on our production deployment, see https://github.com/bird-house/birdhouse-deploy/compare/7dcb2081cdcd876ac9df10cbd4869e2e6a0b3d3a...Ouranosinc:birdhouse-deploy:8bdf5271df66803730da6a0c34d46e2952399ab2

We've seen that when caching is enable, we have more intermittent read problem, and that's not just when xarray is hitting Thredds. A simple DescribeService GET request on any WPS could also result in error so that's why we disabled it on our production pretty much since caching was introduced. Caching was introduced to solve the performance problem.

Is it possible that the caching config is not honored anymore or caching is permanently enabled in recent Twitcher/Magpie?

Below is the summary of a discussion with @fmigneault on April 5.

The intermittent problem has been seen by our nightly Jenkins since 2023-02-18. Nothing before since we have more than 3 months of logs.

Magpie was bumped from 3.26.0 to 3.31.0 on 2023-02-10 in this PR https://github.com/bird-house/birdhouse-deploy/pull/283. It went live on Ouranos production on 2023-02-14 via this merge and overnight autodeploy https://github.com/Ouranosinc/birdhouse-deploy/commit/f72d36c2d51a49edd4e124206c2481a0f4a9310a.

Jenkins nightly logs investigation summary:

$ pwd
/var/lib/docker/volumes/jenkins_master_home_80/_data/jobs/PAVICS-e2e-workflow-tests/branches/master/builds

# we have a lot of history before build 1863 that starts to have the intermittent read problem
$ ls
1711  1720  1729  1738  1747  1756  1765  1774  1783  1792  1801  1810  1819  1828  1837  1846  1855  1864  1873  1882  1891  1900  1909
1712  1721  1730  1739  1748  1757  1766  1775  1784  1793  1802  1811  1820  1829  1838  1847  1856  1865  1874  1883  1892  1901  1910
1713  1722  1731  1740  1749  1758  1767  1776  1785  1794  1803  1812  1821  1830  1839  1848  1857  1866  1875  1884  1893  1902  lastFailedBuild
1714  1723  1732  1741  1750  1759  1768  1777  1786  1795  1804  1813  1822  1831  1840  1849  1858  1867  1876  1885  1894  1903  lastStableBuild
1715  1724  1733  1742  1751  1760  1769  1778  1787  1796  1805  1814  1823  1832  1841  1850  1859  1868  1877  1886  1895  1904  lastSuccessfulBuild
1716  1725  1734  1743  1752  1761  1770  1779  1788  1797  1806  1815  1824  1833  1842  1851  1860  1869  1878  1887  1896  1905  lastUnstableBuild
1717  1726  1735  1744  1753  1762  1771  1780  1789  1798  1807  1816  1825  1834  1843  1852  1861  1870  1879  1888  1897  1906  lastUnsuccessfulBuild
1718  1727  1736  1745  1754  1763  1772  1781  1790  1799  1808  1817  1826  1835  1844  1853  1862  1871  1880  1889  1898  1907  legacyIds
1719  1728  1737  1746  1755  1764  1773  1782  1791  1800  1809  1818  1827  1836  1845  1854  1863  1872  1881  1890  1899  1908  permalinks

$ grep 'Could not find service' */log
1863/log:[2023-02-18T00:53:47.669Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1865/log:[2023-02-20T00:55:05.763Z]         <ExceptionText>Could not find service: raven</ExceptionText>
1869/log:[2023-02-23T23:11:06.578Z]         <ExceptionText>Could not find service: hummingbird</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.580Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.580Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1870/log:[2023-02-24T01:22:38.627Z] context: <?xml^ version="1.0" encoding="utf-8"?><ExceptionReport version="1.0.0" xmlns="http://www.opengis.net/ows/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd"> <Exception exceptionCode="NoApplicableCode" locator="NotAcceptable"> <ExceptionText>Could not find service: thredds</ExceptionText> </Exception></ExceptionReport>
1871/log:[2023-02-25T01:15:56.334Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1883/log:[2023-03-09T01:14:03.264Z]         <ExceptionText>Could not find service: flyingpigeon</ExceptionText>
1884/log:[2023-03-10T01:08:27.659Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1885/log:[2023-03-11T01:04:03.039Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1891/log:[2023-03-17T00:50:30.267Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1893/log:[2023-03-19T00:52:47.966Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1895/log:[2023-03-21T01:03:05.097Z]         <ExceptionText>Could not find service: finch</ExceptionText>
1898/log:[2023-03-24T01:09:52.653Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1898/log:[2023-03-24T01:09:52.653Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1899/log:[2023-03-25T01:15:08.568Z]         <ExceptionText>Could not find service: finch</ExceptionText>
1899/log:[2023-03-25T01:15:08.569Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1899/log:[2023-03-25T01:15:08.569Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1902/log:[2023-03-28T01:09:04.500Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1906/log:[2023-04-01T01:19:54.171Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1906/log:[2023-04-01T01:19:54.171Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
1910/log:[2023-04-05T01:06:40.663Z]         <ExceptionText>Could not find service: thredds</ExceptionText>
tlvu commented 1 year ago

Just an update, the intermittent failure is detected very frequently with the stress-test.ipynb during our Jenkins nightly runs.

Screenshot from 2023-05-01 09-13-44

fmigneault commented 1 year ago

@tlvu
If you could extract the docker-compose logs with Magpie/Twitcher outputs corresponding to those failing stress-test runs, that would most probably help pinpoint where the problem occurs.

tlvu commented 1 year ago

@tlvu If you could extract the docker-compose logs with Magpie/Twitcher outputs corresponding to those failing stress-test runs, that would most probably help pinpoint where the problem occurs.

@fmigneault Waiting for a failure. The logs for 28-Apr already gone. And you want logs from docker logs twitcher or docker logs magpie ?

fmigneault commented 1 year ago

@tlvu Thanks. No rush. Logs from magpie, twitcher, thredds and proxy to have the full chain of URL redirects that led to the error.

tlvu commented 1 year ago

In docker logs twitcher: Could not get service with name thredds

2023-05-09 01:24:11,868 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,914 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,930 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_3] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,930 INFO  [TWITCHER:57][ThreadPoolExecutor-0_3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,959 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,988 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_0] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,988 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,990 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_2] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,990 INFO  [TWITCHER:57][ThreadPoolExecutor-0_2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,034 INFO  [TWITCHER:57][ThreadPoolExecutor-0_1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,042 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,056 INFO  [TWITCHER:57][ThreadPoolExecutor-0_1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,076 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_1] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,077 INFO  [TWITCHER:57][ThreadPoolExecutor-0_2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,078 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'

2023-05-09 01:24:12,094 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds

2023-05-09 01:24:12,122 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,123 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_0] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,123 INFO  [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
tlvu commented 1 year ago

Matching logs from thredds container, /usr/local/tomcat/logs/localhost_access_log.2023-05-09.txt file:

172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541                                                                                                                                                                                                  
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541                                                                                                                                                                                                  
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541                                                                                                                                                                                                  
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541                                                                                                                                                                                                  
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541                                                                                                                                                                                                  
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/nrcan/50MC-1yrF-gz7/tasmax/catalog.html?dataset=birdhouse%2Ftestdata%2Fnrcan%2F50MC-1yrF-gz7%2Ftasmax
%2Fnrcan_canada_daily_tasmax_2009.nc HTTP/1.1" 200 4844
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
tlvu commented 1 year ago

No errors in docker logs magpie and access_log from proxy container already gone.

So @fmigneault that's all I've got, sorry no stack-trace for you. If you need something else, please be specific with what command and where I should look it up.

tlvu commented 1 year ago

Below all the "could not get service name" type of error in docker logs twitcher. There are no stack-trace around those errors:

$ docker logs twitcher 2>&1 | grep ERROR
2023-05-04 16:53:12,723 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-04 17:29:58,124 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-05 22:20:01,798 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name malleefowl
2023-05-07 00:38:04,647 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-07 00:50:35,185 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name flyingpigeon
2023-05-08 00:55:09,584 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds
2023-05-08 00:58:43,493 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds
2023-05-08 01:17:03,611 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name finch
2023-05-08 09:19:19,819 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds
2023-05-09 01:24:12,094 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds
fmigneault commented 1 year ago

@tlvu Do you have Twitcher logger running in DEBUG level? Default is INFO: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/config/twitcher/default.env#L4

It seems the critical error information would be obtained from this. https://github.com/bird-house/twitcher/blob/master/twitcher/owsregistry.py#L47-L50

There's probably something raising in Magpie that propagates all the way to this step, but hard to investigate further without it.

tlvu commented 1 year ago

Do you have Twitcher logger running in DEBUG level? Default is INFO: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/config/twitcher/default.env#L4

@fmigneault

Tried setting export TWITCHER_LOG_LEVEL=DEBUG in env.local on my dev VM, run a Jenkins on it and docker logs twitcher 2>&1 | grep DEBUG shows nothing, is that normal?

fmigneault commented 1 year ago

There should be some URLs logged at least because of this line that would be called on any allowed request: https://github.com/bird-house/twitcher/blob/e4c34b0e379f4a31e4da3747c4fd7d45d9bf3ab4/twitcher/owsproxy.py#LL77C19-L77C27

fmigneault commented 1 year ago

Found the problem. This line: https://github.com/bird-house/birdhouse-deploy/blob/0d0a8b934f764ffeedbe2b8092f12ee6f26089c9/birdhouse/config/twitcher/twitcher.ini.template#L115 should be qualname = TWITCHER because of this: https://github.com/bird-house/twitcher/blob/e4c34b0e379f4a31e4da3747c4fd7d45d9bf3ab4/twitcher/owsregistry.py#LL5C39-L5C39

tlvu commented 1 year ago

DEBUG logging enabled in prod, see commit https://github.com/Ouranosinc/birdhouse-deploy/commit/9d1e37a1a091346a749f84f13a539fabbe7a7fad

tlvu commented 1 year ago
2023-05-17 02:21:20,610 DEBUG [TWITCHER:49][ThreadPoolExecutor-0_3] Could not get service with name thredds
Traceback (most recent call last):
  File "/opt/birdhouse/src/twitcher/twitcher/owsregistry.py", line 46, in get_service_by_name
    service = self.store.fetch_by_name(name=name)
  File "/opt/local/src/magpie/magpie/adapter/magpieservice.py", line 173, in fetch_by_name
    service = self._fetch_by_name_cached(name)
  File "/usr/local/lib/python3.7/site-packages/beaker/cache.py", line 565, in cached
    return func(*args, **kwargs)
  File "/opt/local/src/magpie/magpie/adapter/magpieservice.py", line 147, in _fetch_by_name_cached
    service = MagpieService.by_service_name(service_name, db_session=session)
  File "/opt/local/src/magpie/magpie/models.py", line 684, in by_service_name
    service = session.query(Service).filter(Resource.resource_name == service_name).first()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2824, in first
    return self.limit(1)._iter().first()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2919, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1693, in execute
    for idx, fn in enumerate(events_todo):
RuntimeError: deque mutated during iteration
2023-05-17 02:21:20,622 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
fmigneault commented 1 year ago

Worth a try with Magpie 3.33.0 and Twitcher 0.9.0 using Python 3.11 to see if anything improves.

Otherwise, looks like an internal SQLAlchemy problem.