Open tlogan2000 opened 1 year ago
@tlogan2000 @tlvu
The last error in the notebook displays this link as the cause of error:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml
Which returns the following when I call it manually:
Error {
code = 400;
message = "Unrecognized request";
};
This looks like the error I usually get when doing an incorrectly formed request towards THREDDS.
Using the corresponding
https://pavics.ouranos.ca/testthredds/catalog/testdatasets/CanDCS-U6/catalog.html?dataset=testdatasets/CanDCS-U6/day_BCCAQv2%2BANUSPLIN300_MPI-ESM1-2-LR_historical%2Bssp245_r1i1p1f1_gn_1950-2100.ncml
produces the expected results.
I'm guessing there must be a parameter that is incorrectly parsed or forwarded somewhere, which causes the Twitcher redirect to be incorrect. I notice that the testthredds
URL encodes the literal +
as %2B
(as it should, otherwise +
= space) when I hover the links in THREDDS. The Twitcher-protected THREDDS using the Catalog links is also valid, and returns the right response, but not with the OpenDAP generated link:
https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/catalog.html?dataset=datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2%2BANUSPLIN300_MPI-ESM1-2-LR_historical%2Bssp245_r1i1p1f1_gn_1950-2100.ncml
OKhttps://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml.html
OKhttps://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml
("Data URL" under OpenDAP of previous link)
BADSo, I'm not sure where the notebook retrieves the badly formed link from, without the appropriate URL encoding, and why it doesn't do the same problem using testthredds
. Maybe THREDDS internally handles the auto-redirect when it detects literal +
(?), but Twitcher doesn't (and shouldn't) do it.
Is it possible to check if some Magpie/Twitcher/Nginx logs are produced on the Ouranos instance when calling the first link (maybe need to start them with DEBUG to getter more descriptive logs)?
Using the corresponding
https://pavics.ouranos.ca/testthredds/catalog/testdatasets/CanDCS-U6/catalog.html?dataset=testdatasets/CanDCS-U6/day_BCCAQv2%2BANUSPLIN300_MPI-ESM1-2-LR_historical%2Bssp245_r1i1p1f1_gn_1950-2100.ncml
produces the expected results.
Just to clarify this corresponding link is not the opendap data url as it's pointing to the thredds catalog
for the dataset (note the missing the dodsC
portion of the url) exposing the various data access and metadata services: https, opendap, netcdfsubset etc... The exact equivalent would be https://pavics.ouranos.ca/testthredds/dodsC/testdatasets/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml
which gives the same unrecognized request error. This is the 'normal' response that i have always seen when manually navigating to an opendap link. However the xr.open_dataset(url)
call will succeed in reading the data
import xarray as xr
url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip6/pcic/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml"
url1 = "https://pavics.ouranos.ca/testthredds/dodsC/testdatasets/CanDCS-U6/day_BCCAQv2+ANUSPLIN300_MPI-ESM1-2-LR_historical+ssp245_r1i1p1f1_gn_1950-2100.ncml"
display(xr.open_dataset(url))
display(xr.open_dataset(url1))
Both of the above examples work when run as single lines, but the Twitcher version will result in a read error a very small percentage of the time making long running workflow calculations problematic
Old Twitcher/Magpie performance problem that might be related to the performance hit observed by Travis: https://github.com/bird-house/twitcher/issues/97#issuecomment-777808757. I think this "fixed" issue has not fixed completely the problem.
We have 2 problems: performance and intermittent access/read error.
The intermittent read error problem faced by Travis is with caching disabled on our production deployment, see https://github.com/bird-house/birdhouse-deploy/compare/7dcb2081cdcd876ac9df10cbd4869e2e6a0b3d3a...Ouranosinc:birdhouse-deploy:8bdf5271df66803730da6a0c34d46e2952399ab2
We've seen that when caching is enable, we have more intermittent read problem, and that's not just when xarray is hitting Thredds. A simple DescribeService GET request on any WPS could also result in error so that's why we disabled it on our production pretty much since caching was introduced. Caching was introduced to solve the performance problem.
Is it possible that the caching config is not honored anymore or caching is permanently enabled in recent Twitcher/Magpie?
Below is the summary of a discussion with @fmigneault on April 5.
The intermittent problem has been seen by our nightly Jenkins since 2023-02-18. Nothing before since we have more than 3 months of logs.
Magpie was bumped from 3.26.0 to 3.31.0 on 2023-02-10 in this PR https://github.com/bird-house/birdhouse-deploy/pull/283. It went live on Ouranos production on 2023-02-14 via this merge and overnight autodeploy https://github.com/Ouranosinc/birdhouse-deploy/commit/f72d36c2d51a49edd4e124206c2481a0f4a9310a.
Jenkins nightly logs investigation summary:
$ pwd
/var/lib/docker/volumes/jenkins_master_home_80/_data/jobs/PAVICS-e2e-workflow-tests/branches/master/builds
# we have a lot of history before build 1863 that starts to have the intermittent read problem
$ ls
1711 1720 1729 1738 1747 1756 1765 1774 1783 1792 1801 1810 1819 1828 1837 1846 1855 1864 1873 1882 1891 1900 1909
1712 1721 1730 1739 1748 1757 1766 1775 1784 1793 1802 1811 1820 1829 1838 1847 1856 1865 1874 1883 1892 1901 1910
1713 1722 1731 1740 1749 1758 1767 1776 1785 1794 1803 1812 1821 1830 1839 1848 1857 1866 1875 1884 1893 1902 lastFailedBuild
1714 1723 1732 1741 1750 1759 1768 1777 1786 1795 1804 1813 1822 1831 1840 1849 1858 1867 1876 1885 1894 1903 lastStableBuild
1715 1724 1733 1742 1751 1760 1769 1778 1787 1796 1805 1814 1823 1832 1841 1850 1859 1868 1877 1886 1895 1904 lastSuccessfulBuild
1716 1725 1734 1743 1752 1761 1770 1779 1788 1797 1806 1815 1824 1833 1842 1851 1860 1869 1878 1887 1896 1905 lastUnstableBuild
1717 1726 1735 1744 1753 1762 1771 1780 1789 1798 1807 1816 1825 1834 1843 1852 1861 1870 1879 1888 1897 1906 lastUnsuccessfulBuild
1718 1727 1736 1745 1754 1763 1772 1781 1790 1799 1808 1817 1826 1835 1844 1853 1862 1871 1880 1889 1898 1907 legacyIds
1719 1728 1737 1746 1755 1764 1773 1782 1791 1800 1809 1818 1827 1836 1845 1854 1863 1872 1881 1890 1899 1908 permalinks
$ grep 'Could not find service' */log
1863/log:[2023-02-18T00:53:47.669Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1865/log:[2023-02-20T00:55:05.763Z] <ExceptionText>Could not find service: raven</ExceptionText>
1869/log:[2023-02-23T23:11:06.578Z] <ExceptionText>Could not find service: hummingbird</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.579Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.580Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1869/log:[2023-02-23T23:11:06.580Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1870/log:[2023-02-24T01:22:38.627Z] context: <?xml^ version="1.0" encoding="utf-8"?><ExceptionReport version="1.0.0" xmlns="http://www.opengis.net/ows/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd"> <Exception exceptionCode="NoApplicableCode" locator="NotAcceptable"> <ExceptionText>Could not find service: thredds</ExceptionText> </Exception></ExceptionReport>
1871/log:[2023-02-25T01:15:56.334Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1883/log:[2023-03-09T01:14:03.264Z] <ExceptionText>Could not find service: flyingpigeon</ExceptionText>
1884/log:[2023-03-10T01:08:27.659Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1885/log:[2023-03-11T01:04:03.039Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1891/log:[2023-03-17T00:50:30.267Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1893/log:[2023-03-19T00:52:47.966Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1895/log:[2023-03-21T01:03:05.097Z] <ExceptionText>Could not find service: finch</ExceptionText>
1898/log:[2023-03-24T01:09:52.653Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1898/log:[2023-03-24T01:09:52.653Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1899/log:[2023-03-25T01:15:08.568Z] <ExceptionText>Could not find service: finch</ExceptionText>
1899/log:[2023-03-25T01:15:08.569Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1899/log:[2023-03-25T01:15:08.569Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1902/log:[2023-03-28T01:09:04.500Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1906/log:[2023-04-01T01:19:54.171Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1906/log:[2023-04-01T01:19:54.171Z] <ExceptionText>Could not find service: thredds</ExceptionText>
1910/log:[2023-04-05T01:06:40.663Z] <ExceptionText>Could not find service: thredds</ExceptionText>
Just an update, the intermittent failure is detected very frequently with the stress-test.ipynb
during our Jenkins nightly runs.
@tlvu
If you could extract the docker-compose logs with Magpie/Twitcher outputs corresponding to those failing stress-test runs, that would most probably help pinpoint where the problem occurs.
@tlvu If you could extract the docker-compose logs with Magpie/Twitcher outputs corresponding to those failing stress-test runs, that would most probably help pinpoint where the problem occurs.
@fmigneault
Waiting for a failure. The logs for 28-Apr already gone. And you want logs from docker logs twitcher
or docker logs magpie
?
@tlvu
Thanks. No rush.
Logs from magpie
, twitcher
, thredds
and proxy
to have the full chain of URL redirects that led to the error.
In docker logs twitcher
: Could not get service with name thredds
2023-05-09 01:24:11,868 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,914 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,930 INFO [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_3] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,930 INFO [TWITCHER:57][ThreadPoolExecutor-0_3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,959 INFO [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,988 INFO [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_0] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,988 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:11,990 INFO [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_2] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:11,990 INFO [TWITCHER:57][ThreadPoolExecutor-0_2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,034 INFO [TWITCHER:57][ThreadPoolExecutor-0_1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,042 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,056 INFO [TWITCHER:57][ThreadPoolExecutor-0_1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,076 INFO [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_1] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,077 INFO [TWITCHER:57][ThreadPoolExecutor-0_2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,078 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2023-05-09 01:24:12,094 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds
2023-05-09 01:24:12,122 INFO [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'browse' permission on [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,123 INFO [TWITCHER|magpie.adapter.magpieowssecurity:245][ThreadPoolExecutor-0_0] User None resolved with allowed 'browse' access to [/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html]
2023-05-09 01:24:12,123 INFO [TWITCHER:57][ThreadPoolExecutor-0_0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
Matching logs from thredds container, /usr/local/tomcat/logs/localhost_access_log.2023-05-09.txt
file:
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:11 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/nrcan/50MC-1yrF-gz7/tasmax/catalog.html?dataset=birdhouse%2Ftestdata%2Fnrcan%2F50MC-1yrF-gz7%2Ftasmax
%2Fnrcan_canada_daily_tasmax_2009.nc HTTP/1.1" 200 4844
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
172.19.0.1 - - [09/May/2023:01:24:12 +0000] "GET /twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc HTTP/1.
1" 200 4541
No errors in docker logs magpie
and access_log
from proxy container already gone.
So @fmigneault that's all I've got, sorry no stack-trace for you. If you need something else, please be specific with what command and where I should look it up.
Below all the "could not get service name" type of error in docker logs twitcher
. There are no stack-trace around those errors:
$ docker logs twitcher 2>&1 | grep ERROR
2023-05-04 16:53:12,723 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-04 17:29:58,124 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-05 22:20:01,798 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name malleefowl
2023-05-07 00:38:04,647 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
2023-05-07 00:50:35,185 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name flyingpigeon
2023-05-08 00:55:09,584 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds
2023-05-08 00:58:43,493 ERROR [TWITCHER:50][ThreadPoolExecutor-0_1] Could not get service with name thredds
2023-05-08 01:17:03,611 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name finch
2023-05-08 09:19:19,819 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds
2023-05-09 01:24:12,094 ERROR [TWITCHER:50][ThreadPoolExecutor-0_2] Could not get service with name thredds
@tlvu Do you have Twitcher logger running in DEBUG level? Default is INFO: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/config/twitcher/default.env#L4
It seems the critical error information would be obtained from this. https://github.com/bird-house/twitcher/blob/master/twitcher/owsregistry.py#L47-L50
There's probably something raising in Magpie that propagates all the way to this step, but hard to investigate further without it.
Do you have Twitcher logger running in DEBUG level? Default is INFO: https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/config/twitcher/default.env#L4
@fmigneault
Tried setting export TWITCHER_LOG_LEVEL=DEBUG
in env.local
on my dev VM, run a Jenkins on it and docker logs twitcher 2>&1 | grep DEBUG
shows nothing, is that normal?
There should be some URLs logged at least because of this line that would be called on any allowed request: https://github.com/bird-house/twitcher/blob/e4c34b0e379f4a31e4da3747c4fd7d45d9bf3ab4/twitcher/owsproxy.py#LL77C19-L77C27
Found the problem.
This line: https://github.com/bird-house/birdhouse-deploy/blob/0d0a8b934f764ffeedbe2b8092f12ee6f26089c9/birdhouse/config/twitcher/twitcher.ini.template#L115
should be qualname = TWITCHER
because of this: https://github.com/bird-house/twitcher/blob/e4c34b0e379f4a31e4da3747c4fd7d45d9bf3ab4/twitcher/owsregistry.py#LL5C39-L5C39
DEBUG logging enabled in prod, see commit https://github.com/Ouranosinc/birdhouse-deploy/commit/9d1e37a1a091346a749f84f13a539fabbe7a7fad
2023-05-17 02:21:20,610 DEBUG [TWITCHER:49][ThreadPoolExecutor-0_3] Could not get service with name thredds
Traceback (most recent call last):
File "/opt/birdhouse/src/twitcher/twitcher/owsregistry.py", line 46, in get_service_by_name
service = self.store.fetch_by_name(name=name)
File "/opt/local/src/magpie/magpie/adapter/magpieservice.py", line 173, in fetch_by_name
service = self._fetch_by_name_cached(name)
File "/usr/local/lib/python3.7/site-packages/beaker/cache.py", line 565, in cached
return func(*args, **kwargs)
File "/opt/local/src/magpie/magpie/adapter/magpieservice.py", line 147, in _fetch_by_name_cached
service = MagpieService.by_service_name(service_name, db_session=session)
File "/opt/local/src/magpie/magpie/models.py", line 684, in by_service_name
service = session.query(Service).filter(Resource.resource_name == service_name).first()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2824, in first
return self.limit(1)._iter().first()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2919, in _iter
execution_options={"_sa_orm_load_options": self.load_options},
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1693, in execute
for idx, fn in enumerate(events_todo):
RuntimeError: deque mutated during iteration
2023-05-17 02:21:20,622 ERROR [TWITCHER:50][ThreadPoolExecutor-0_3] Could not get service with name thredds
Describe the bug
We have noticed intermittant netcdf read errors when accessing opendap links in the PAVICS jupyterhub. Approximate date when the problem began ~ March 01 2023
To Reproduce
The intermittant nature of this problem makes reproducing somewhat difficult but a public notebook on the PAVICS server is available here : https://pavics.ouranos.ca/jupyter/hub/user-redirect/lab/tree/public/logan-public/Tests/THREDDS_Issues_March2023/Random_Thredds_read_errors.ipynb
The notebook will execute a relatively large workflow and uses multiple dask worker processes to accentuate the possibility of a read error
Multiple notebook runs (note only ~5-6 due to time needed) have shown that bypassing twitcher (i.e. thredds with nginx proxy only) always allows successful completion of the calculations whereas accessing opendap links behind the nginx/twitcher combinations typically results in a read error relatively quickly in the workflow.
Although not quantified there also seems to be a general performance hit when accessing data via nginx/twitcher (print outs of execution times in the workflow loop between 25-40 sec with nginx/twitcher versus 18-30 seconds with nginx only). Note also, that the notebook runs 'nginx-only' code first so I do not believe the performance difference is benefiting from caching of data or if it is should benefit the 'twitcher/nginx' run.
Expected behavior
Execution of code without read error