ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
224 stars 128 forks source link

Unacceptable character in recipe_eady_growth_rate.yml #2652

Closed bouweandela closed 2 years ago

bouweandela commented 2 years ago

With esmvaltool v2.5 and this conda environment on the Levante Jupyterhub, I get the following error message when I run

import esmvalcore.experimental as esmvaltool
all_recipes = esmvaltool.get_all_recipes()
---------------------------------------------------------------------------
ReaderError                               Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 all_recipes = esmvaltool.get_all_recipes()
      2 all_recipes

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvalcore/experimental/utils.py:62, in get_all_recipes(subdir)
     60 rootdir = DIAGNOSTICS.recipes
     61 files = rootdir.glob(f'{subdir}/*.yml')
---> 62 return RecipeList(Recipe(file) for file in files)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvalcore/experimental/utils.py:62, in <genexpr>(.0)
     60 rootdir = DIAGNOSTICS.recipes
     61 files = rootdir.glob(f'{subdir}/*.yml')
---> 62 return RecipeList(Recipe(file) for file in files)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvalcore/experimental/recipe.py:42, in Recipe.__init__(self, path)
     40 self._data: Optional[Dict] = None
     41 self.last_session: Optional[Session] = None
---> 42 self.info = RecipeInfo(self.data, filename=self.path.name)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvalcore/experimental/recipe.py:74, in Recipe.data(self)
     72 """Return dictionary representation of the recipe."""
     73 if self._data is None:
---> 74     self._data = yaml.safe_load(open(self.path, 'r'))
     75 return self._data

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/__init__.py:125, in safe_load(stream)
    117 def safe_load(stream):
    118     """
    119     Parse the first YAML document in a stream
    120     and produce the corresponding Python object.
   (...)
    123     to be safe for untrusted input.
    124     """
--> 125     return load(stream, SafeLoader)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/__init__.py:79, in load(stream, Loader)
     74 def load(stream, Loader):
     75     """
     76     Parse the first YAML document in a stream
     77     and produce the corresponding Python object.
     78     """
---> 79     loader = Loader(stream)
     80     try:
     81         return loader.get_single_data()

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/loader.py:34, in SafeLoader.__init__(self, stream)
     33 def __init__(self, stream):
---> 34     Reader.__init__(self, stream)
     35     Scanner.__init__(self)
     36     Parser.__init__(self)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/reader.py:85, in Reader.__init__(self, stream)
     83 self.eof = False
     84 self.raw_buffer = None
---> 85 self.determine_encoding()

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/reader.py:135, in Reader.determine_encoding(self)
    133         self.raw_decode = codecs.utf_8_decode
    134         self.encoding = 'utf-8'
--> 135 self.update(1)

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/reader.py:169, in Reader.update(self, length)
    167     data = self.raw_buffer
    168     converted = len(data)
--> 169 self.check_printable(data)
    170 self.buffer += data
    171 self.raw_buffer = self.raw_buffer[converted:]

File ~/.conda/envs/esmvaltool/lib/python3.10/site-packages/yaml/reader.py:143, in Reader.check_printable(self, data)
    141 character = match.group()
    142 position = self.index+(len(self.buffer)-self.pointer)+match.start()
--> 143 raise ReaderError(self.name, position, ord(character),
    144         'unicode', "special characters are not allowed")

ReaderError: unacceptable character #x0080: special characters are not allowed
  in "/home/k/k206100/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvaltool/recipes/recipe_eady_growth_rate.yml", position 353
bouweandela commented 2 years ago

I think it's the dash on this line: https://github.com/ESMValGroup/ESMValTool/blob/d75b479d42154676ce39fa8e06b97081aeb40c1f/esmvaltool/recipes/recipe_eady_growth_rate.yml#L10

It's a bit puzzling that I do not get this error in other environments/machines..

valeriupredoi commented 2 years ago

this-will-not-stand-man

valeriupredoi commented 2 years ago

but seriously now, it's how YAML reads encoded stuff, and only the printable chars from UTF-8 are allowed, see this Stackoverflow post - it is interesting that that's not picked up anywhere else - you using an older pyyaml?

bouweandela commented 2 years ago

No, pyyaml 6.0, the same version works fine on my own computer. I suspect it's something to do again with encoding characters and how that's set up using environmental variables..

valeriupredoi commented 2 years ago

gah! We should really make a complete move to ruamel :+1:

zklaus commented 2 years ago

It's not the dash per se, but rather the invisible PAD character. Here's a hex dump of that line:

0000000                   J   o   u   r   n   a   l       o   f       t
           2020    2020    6f4a    7275    616e    206c    666f    7420
0000020   h   e       a   t   m   o   s   p   h   e   r   i   c       s
           6568    6120    6d74    736f    6870    7265    6369    7320
0000040   c   i   e   n   c   e   s   ,       4   7   (   1   5   )   :
           6963    6e65    6563    2c73    3420    2837    3531    3a29
0000060   1   8   5   4 342 200 223   1   8   6   4   ,       1   9   9
           3831    3435    80e2    3193    3638    2c34    3120    3939
0000100   0   .   )   .  \n
           2e30    2e29    000a
valeriupredoi commented 2 years ago

such high values of 2.e+30 will not stand, man! :laughing: What's a PAD character, Klaus? Nevermind, it really is a padding character :man_facepalming:

bouweandela commented 2 years ago

It is something to do with our code assuming that the files are encoded in 'utf-8' instead of saying so everywhere explicitly. When I run

import locale
locale.getpreferredencoding(False)

I get 'ISO-8859-1' on the Levante notebook server instead of the usual'UTF-8'. This results in the wrong interpretation of the file. This code

from pathlib import Path

file = Path("/home/k/k206100/.conda/envs/esmvaltool/lib/python3.10/site-packages/esmvaltool/recipes/recipe_eady_growth_rate.yml")
txt = file.read_text()
for i, char in enumerate(txt[350:355]):
    print(350+i, char, hex(ord(char)))

produces

350 5 0x35
351 4 0x34
352 â 0xe2
353 € 0x80
354 “ 0x93

while after running

import locale
locale.setlocale(locale.LC_CTYPE, 'en_US.UTF-8')

this results in

350 5 0x35
351 4 0x34
352 – 0x2013
353 1 0x31
354 8 0x38
bouweandela commented 2 years ago

@valeriupredoi Remember #973?

Maybe it is time we fix this problem, apparently open takes an encoding='utf-8' argument, so if we specify that everywhere we open a text file in ESMValCore, the problem should go away.

valeriupredoi commented 2 years ago

yes, that thing again! But I cordially protest we should be using en_GB.UTF-8, please, mate :gb: :beer:

bouweandela commented 2 years ago

Thanks for the help gents!

valeriupredoi commented 2 years ago

I helped by posting a Big Lebowski meme :laughing: