earthobservations / wetterdienst

Open weather data for humans.
https://wetterdienst.readthedocs.io/
MIT License
358 stars 55 forks source link

`MetaFileNotFound` because of missing root certificates #827

Closed scherbinek closed 1 year ago

scherbinek commented 1 year ago

Hey!

I got the same error as described in https://github.com/earthobservations/wetterdienst/issues/678

Describe the bug

Traceback (most recent call last):
  File "/opt/airflow/dags/dwd_kl_daily.py", line 67, in <module>
    r1 = DwdObservationRequest(
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/core/scalar/request.py", line 624, in all
    df = self._all().copy().reset_index(drop=True)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/api.py", line 561, in _all
    df = create_meta_index_for_climate_observations(dataset, self.resolution, period)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 84, in create_meta_index_for_climate_observations
    meta_index = _create_meta_index_for_climate_observations(dataset, resolution, period)
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 142, in _create_meta_index_for_climate_observations
    meta_file = _find_meta_file(files_server, url, ["beschreibung", "txt"])
  File "/home/airflow/.local/lib/python3.9/site-packages/wetterdienst/provider/dwd/observation/metaindex.py", line 170, in _find_meta_file
    **raise MetaFileNotFound(f"No meta file was found amongst the files at {url}.")**
wetterdienst.exceptions.MetaFileNotFound: No meta file was found amongst the files at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/.

To Reproduce Nothing special. Just a simple request which works locally on my computer.

from os import environ
environ['WD_CACHE_DISABLE'] = 'True'
from wetterdienst.provider.dwd.observation import DwdObservationRequest
from wetterdienst.provider.dwd.observation import DwdObservationRequest

Settings.cache_disable = True
r1 = DwdObservationRequest(
                  parameter=['climate_summary'],
                  resolution='daily',
                  period='recent'
).all()

Desktop (please complete the following information):

Additional context The script works perfectly fine on the local computer. But crashes with the above mentioned error on a server instance within a docker container of apache airflow. I already switched off the cache to avoid any issues. But wetterdienst.info() refers to a location at /home/airflow/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder). The airflow log refers to wetterdienst.util.fsspec_monkeypatch - INFO - Dircache located at /root/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder).

It seems that fsspec tries to resolve a cache directory for parsing the metadate file from the url but receives an empty list of files which led to the error and doesn't even try to request the content of the url. The dircache at /root/.cache/ seems to be misleading as it shouldn't be started as root. So my best guess is some authorization issue in a linux based context based on the fsspec_monkeypatch cache.

I'll give it a further try tomorrow. I try to debug the issue and share my result. But I am thankful for any hints. Initially I tried to search for an environment variable to overwrite the fsspec cache.

Regards, Marcel

amotl commented 1 year ago

Dear @scherbinek,

thank you for the excellent report. @larsrinn recently reported a similar thing at #704, that the cache control environment variables WD_CACHE_DISABLE and WD_CACHE_DIR would not be honored correctly, which have been introduced with version 0.18.0 ^1.

However, when looking for them in the current state of the code base, I can not find either of them. It looks like 9c7cee5940 got lost somehow? Do you have any clue about it, @gutzbenj?

With kind regards, Andreas.

amotl commented 1 year ago

Oh, the code is there, but because the prefix WD_ is handled in a separate line of code, I have not been able to spot it.

https://github.com/earthobservations/wetterdienst/blob/f6d82088891e00e57dcfa5d7c71e98993690713d/wetterdienst/settings.py#L21-L29

amotl commented 1 year ago

Oh, and I also spotted this one. Not sure whether use_listings_cache=True is "always on" here, even when running with cache disabled?

https://github.com/earthobservations/wetterdienst/blob/f6d82088891e00e57dcfa5d7c71e98993690713d/wetterdienst/util/network.py#L39-L46

Edit: I've addressed this with GH-828, but I think this is only a cosmetic issue, and not responsible for any functional flaw.

amotl commented 1 year ago

I've exercised your scenario using the following program, using Wetterdienst 0.50.0, on both macOS and within a Docker container.

#
# Synopsis:
#
#   docker run --rm -it python:3.10-bullseye bash
#   pip install wetterdienst
#   python example-827.py
#
import logging

from wetterdienst import Settings
from wetterdienst.provider.dwd.observation import DwdObservationRequest

logger = logging.getLogger(__name__)

def process():
    Settings.cache_disable = True
    r1 = DwdObservationRequest(
                      parameter=['climate_summary'],
                      resolution='daily',
                      period='recent'
    ).all()
    print(r1)

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    process()

Using Settings.cache_disable = True, to turn off caching, works perfectly well for me [^1], I am able to confirm that no directory has been created at either /Users/amo/Library/Caches/wetterdienst (macOS) or /root/.cache/wetterdienst (Linux/Docker), after running that program.

Maybe you can share more details about your Docker environment, as being driven by Airflow? Maybe any special parameters or options are be used?

Which versions of Wetterdienst and Docker are you running?

[^1]: so does environ['WD_CACHE_DISABLE'] = 'True'.

amotl commented 1 year ago

Maybe it was really just an upstream error / fluke?

wetterdienst.exceptions.MetaFileNotFound: No meta file was found amongst the files at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/.
amotl commented 1 year ago

The airflow log refers to wetterdienst.util.fsspec_monkeypatch - INFO - Dircache located at /root/.cache/wetterdienst which doesn't exist as folder (and wasn't solved by creating the wetterdienst folder).

That log message was misleading, it will be fixed with GH-828. Thank you.

scherbinek commented 1 year ago

Hi @amotl

Thank you for your detailed analysis and description. I tested as well your docker setup including the example-827.py and can confirm a working scenario as well. It even works with my server setup locally... but throws the mentioned error on my server. Testing it locally and on my servers step by step led to the actual error.

The error seems so simple that I curled the website on my server but everything was fine. But i didn't try to curl the requested webiste https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/ in my docker setup on the server.

airflow@653d5258586b:/opt/airflow/dags$ curl https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/recent/ curl: (77) error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs

For using a secure connection I use SSL certificated on my server and mounted them to the docker container as well. Something I commented out locally as it runs on localhost.

volumes:

But as I only mount the servers /usr/share/pki/trust/anchors without any ca-certificates for checking the SSL certificate of other websites, I receive can't requests the website with a default verify on SSL. Thus I only have to mount my servers certificates to another folder than /etc/ssl/certs as it overwrites all ca-certificates of the docker container.

**- /usr/share/pki/trust/anchors: /usr/share/pki/trust/anchors**

And. It works. At least it was a tricky one as I never thought that the url can't be verified and therefore it pops up the initial mentioned error. Additionally it is not the first url I request but also request and use the Genesis API of the Statistischen Bundesamt - without problems.

Hopefully this issue might be a hint for further setups as mine. And sorry for any inconvenience as it feels like self-owned. And I even more don't like handling with SSL and certificates. The issue can be closed except you have any follow-up questions on my issue.

Regards, Marcel

amotl commented 1 year ago

Hi @scherbinek,

thank you for your response, I am happy it works for you now. However, I will reopen this issue, because I would like to investigate if we should include the certifi package as a dependency, and if this would have improved the situation in your case.

With kind regards, Andreas.

gutzbenj commented 1 year ago

I think we can close this.

certifi is already indirectly in our dependents (probably through fsspec/requests) and the issue can't be resolved by installing certifi but rather by linking it to system installed certificates.

amotl commented 1 year ago

The issue can't be resolved by installing certifi but rather by linking it to system installed certificates.

I was about to agree, but wasn't fully convinced [^1], so I just looked up the topic on the corresponding urllib3 and aiohttp documentations.

urllib3

It looks like there is an option to make urllib3 use the certificates from the certifi package, and it is well documented.

Unless otherwise specified urllib3 will try to load the default system certificate stores. The most reliable cross-platform method is to use the certifi package which provides Mozilla’s root certificate bundle.

Once you have certificates, you can create a PoolManager that verifies certificates when making requests:

>>> import certifi
>>> import urllib3
>>> http = urllib3.PoolManager(
...     cert_reqs='CERT_REQUIRED',
...     ca_certs=certifi.where()
... )

-- https://urllib3.readthedocs.io/en/stable/user-guide.html#certificate-verification

[^1]: I mean, what would be the point of providing the certificates per Python package then, if you can't make Python actually use it?

aiohttp

It looks like aiohttp does not document how to use certificates from certifi. Evaluate "aiohttp" with "certifi" has a corresponding example program, its gist is:

import aiohttp
import certifi
import ssl

sslcontext = ssl.create_default_context(cafile=certifi.where())
session = aiohttp.ClientSession()
response = await session.get("https://www.hrw.org/", ssl=sslcontext)

Do you think we should carry that information forward to both the aiospec and the fsspec projects, to improve their documentation and their internals?

References

gutzbenj commented 1 year ago

Sure! But my honest opinion is: I've only seen this error once on a managed machine at work and there probably if you get this error nothing else works as well.

Usually if you install python (and maybe requests afterwards) everything should work out of the box and if not we wouldn't be able to provide any help and aiohttp neither, but the user would rather have to make sure that certificates on the machine are correctly installed.

gutzbenj commented 1 year ago

Closing this as is not related to anything on our end.