Unidata / siphon

Siphon - A collection of Python utilities for retrieving atmospheric and oceanic data from remote sources, focusing on being able to retrieve data from Unidata data technologies, such as the THREDDS data server.
https://unidata.github.io/siphon
BSD 3-Clause "New" or "Revised" License
213 stars 75 forks source link

TDSCatalog does not include base path in access_url #724

Open jm-cook opened 1 year ago

jm-cook commented 1 year ago

TDSCatalog is not constructing the access_url correctly when there is a base path.

from siphon.catalog import TDSCatalog
cat_url = "https://opendap1-test.nodc.no/opendap/hyrax/DSG/Physics/project/SmartOcean/Austevoll-Nord/catalog.xml"
cat = TDSCatalog(cat_url)
for cds in cat.datasets:
    url = cat.datasets[cds].access_urls['dap']
    print(f'url = {url}')

The result is

url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202205.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202206.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202207.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202209.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202210.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202211.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202212.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202301.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202302.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202303.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202304.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202305.nc url = https://opendap1-test.nodc.no/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202306.nc

but 'opendap/hyrax' is not in the path.

The correct path should be:

https://opendap1-test.nodc.no/opendap/hyrax/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202205.nc
... etc

In the catalog.xml this is given as base="/opendap/hyrax" for the dap service

<thredds:catalog xmlns:thredds="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:bes="http://xml.opendap.org/ns/bes/1.0#">
<thredds:service name="dap" serviceType="OPeNDAP" base="/opendap/hyrax"/>
<thredds:service name="file" serviceType="HTTPServer" base="/opendap/hyrax"/>
<thredds:service name="WCS-coads" serviceType="WCS" base="/opendap/wcs"/>
<thredds:dataset name="/DSG/Physics/project/SmartOcean/Austevoll-Nord" ID="/opendap/hyrax/DSG/Physics/project/SmartOcean/Austevoll-Nord/">
<thredds:dataset name="NMDC_AR_MO_Austevoll-Nord_202205.nc" ID="/opendap/hyrax/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202205.nc">
<thredds:dataSize units="bytes">1145638</thredds:dataSize>
<thredds:date type="modified">2023-06-23T09:28:13</thredds:date>
<thredds:access serviceName="dap" urlPath="/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202205.nc"/>
<thredds:access serviceName="file" urlPath="/DSG/Physics/project/SmartOcean/Austevoll-Nord/NMDC_AR_MO_Austevoll-Nord_202205.nc"/>
</thredds:dataset>
...

Python version : 3.8.8 Siphon version: 0.9

jm-cook commented 1 year ago

Might be related to https://github.com/Unidata/siphon/issues/114

However that was fixed?

jm-cook commented 1 year ago

I looked some more into this since I was curious as to why the fix to #114 does not solve my issue. I ran my little test on the oceandata opendap catalog link in #114 (now updated to https://oceandata.sci.gsfc.nasa.gov/opendap/SeaWiFS/L3SMI/2000/0101/catalog.xml) and see the same issue, ie opendap/hyrax is not inserted into the url.

What is happening is that when access_urls is constructed in make_access_urls(), the server_base url is correctly constructed as 'https://oceandata.sci.gsfc.nasa.gov/opendap/hyrax', the url_path is obtained as the absolute path /SeaWiFS/L3SMI/2000/0101/SEASTAR_SEAWIFS_GAC.20000101.L3m.DAY.CHL.chlor_a.9km.nc but then on the next line

                    access_urls[subservice.service_type] = urljoin(server_base,  self.url_path)

urljoin will create the absolute url, so the opendap/hyrax part of the server_base url is lost.

This seems to be how hyrax is presenting the paths (ie with a leading slash).

According to the documentation here: https://docs.unidata.ucar.edu/tds/5.2/userguide/basic_client_catalog.html, the service base, urlPath, and dataset should be concatenated together.

The code below shows the incorrectly constructed access_urls (the urls created cannot be accessed):

from siphon.catalog import TDSCatalog
cat_url = "https://oceandata.sci.gsfc.nasa.gov/opendap/SeaWiFS/L3SMI/2000/0101/catalog.xml"
cat = TDSCatalog(cat_url)
for cds in cat.datasets:
    url = cat.datasets[cds].access_urls['dap']
    print(f'access url = {url}')