geomagpy / magpy

MagPy (or GeomagPy) is a Python package for analysing and displaying geomagnetic data.
BSD 3-Clause "New" or "Revised" License
46 stars 27 forks source link

Locale problem can cause strange behaviour #117

Closed stephanbracke closed 2 years ago

stephanbracke commented 2 years ago

In the stream.py we try to change locale to english to be able to make the pattern matching %b work. However switching locale is a source of errors and in the try except errors are often hidden here resulting in very difficult to debug code. First of all if we switch to 'en_US' this locale needs to be present on the client machine otherwise this will cause an hidden error. Also if we only change the LC_TIME locale it is better to get it in the old_locale because the get_locale() can be different from the get_locale(locale.LC_TIME). so https://github.com/geomagpy/magpy/blob/ae91462f1644ebf89fbe570ce1cf67d1d89eb926/magpy/stream.py#L12892-L12896 old_locale should be changed into

    old_locale = locale.getlocale(locale.LC_TIME)

Further on on windows I got as old_locale = ('fr_BE', 'ISO8859-1') using this one in the set_locale afterwards resulted always in locale.Error: unsupported locale setting As you assume that locale should be for the LC_TIME in en_US UTF-8 I added the statement locale.setlocale(locale.LC_TIME, ('en_US', 'UTF-8')) in the init of the application ensuring me that magpy is always using en_US for the LC_TIME
I just added the statement here https://github.com/geomagpy/magpy/blob/ae91462f1644ebf89fbe570ce1cf67d1d89eb926/magpy/gui/magpy_gui.py#L7237-L7243

I added it as first line. A better solution is to use a regular expression to parse the needed string so it will be locale indepent but this is a bit more work. So the prerequisite of magpy should be stated to have the locale en_US,UTF-8 installed on your machine. An extra global remark is to avoid hiding errors by always logging the error and ripple errors to the highest level so users will see them appear. This will facilitate the debugging and will not hide wrong behaviour

leonro commented 2 years ago

My preference would be to remove switching locale at all. strptime should use en_US by default anyway, no matter what locale is used on the system.

stephanbracke commented 2 years ago

strptime is dependent on the locale so if we want to remove it, in this partically code we should not use strptime. In this partical code we try to match the IMFV format https://github.com/geomagpy/magpy/blob/ae91462f1644ebf89fbe570ce1cf67d1d89eb926/magpy/stream.py#L12901

I quickly tried on my machine by setting locale to de_BE

    locale.setlocale(locale.LC_TIME, ('de_BE', 'UTF-8'))
    date = dt.datetime.strptime("OCT3121", '%b%d%y')
    print(date)
   results in
   ValueError: time data 'OCT3121' does not match format '%b%d%y
leonro commented 2 years ago

Yes you can force strptime to use another locale. But if you don't do that it seems to use english as default:

          >>> locale.getlocale()
          ('de_DE', 'UTF-8')
          >>> xdate = datetime(2022,10,11)
          >>> datetime.strftime(xdate,'%b %d %Y')
          'Oct 11 2022'

For de_DE it should be "Okt", which it isnt. Within the code I am forcing it to english although this does not seem to be necessary. (Python 3.8.10)

stephanbracke commented 2 years ago

Ok, I will test it next week on windows by removing the locale stuff and keep you informed

leonro commented 2 years ago

See 0c3913c8ccfc9d0f0c88696e19ce4174b32de74b . Still needs to be checked however

stephanbracke commented 2 years ago

The assumption of LC_TIME will always take en_US is not a valid one. When you tested locale.getlocale() in a console window it gives you the general setting of the locale but when you do locale.getlocale(locale.LC_TIME) it is not necessary the same. In a console this results often ( but no garantee) in (None, None). In this case you can go from the fact he will fallback into en_US. But on windows with the xwindows module I printed out the locale.getlocale(locale.LC_TIME) and it gave ('fr_BE', 'ISO8859-1') and using the code from the branch origin/gui-fixes I still got an error

Traceback (most recent call last):
  File "C:\Users\stbracke\Desktop\MagPy\python-3.9.8.amd64\lib\site-packages\magpy\stream.py", line 12920, in extractDateFromString
    date = datetime.strptime(daystring[-7:], '%b%d%y')
  File "C:\Users\stbracke\Desktop\MagPy\python-3.9.8.amd64\lib\_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "C:\Users\stbracke\Desktop\MagPy\python-3.9.8.amd64\lib\_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data 'JAN0121' does not match format '%b%d%y'

The easiest way to fix it is to add the locale.setlocale(locale.LC_TIME, ('en_US', 'UTF-8')) into the on_init method and let it fail with a clean error if the locale is not installed I will further on explain how we could do it more cleanly but it will be more work and more testing

stephanbracke commented 2 years ago

If we want to make everything independent of the locale we need to replace it on every place where we use the %b pattern this is in total 19 places in different files. usage_%b

We can do that by replacing the 3 letter month with its corresponding number ( with leading zeroes 01 etc) and replace the pattern %b with %m which is independent of locale. Here two ways to do this with enum or with dictionary

def test_with_dict():
    #locale.setlocale(locale.LC_TIME, ('de_BE', 'UTF-8'))
    months={ 'JAN':'01', 'FEB' : '02','MAR' : '03','APR' :'04','MAY' : '05',
             'JUN' : '06', 'JUL' : '07','AUG' : '08','SEP' : '09','OCT' : '10','NOV' : '11',
             'DEC' : '12' }
    test = 'DEC0421'

    new_date = months[test[0:3].upper()] + test[3:]

    date = dt.datetime.strptime(new_date, "%m%d%y")
    print(date)

    #d = dt.datetime.strptime(datestr, "%b %d, %Y")
def test_with_enum():

    @enum.unique
    class MONTH(enum.Enum):
        JAN = '01'
        FEB = '02'
        MAR = '03'
        APR = '04'
        MAY = '05'
        JUN = '06'
        JUL = '07'
        AUG = '08'
        SEP = '09'
        OCT = '10'
        NOV = '11'
        DEC = '12'

    test = 'DEC0421'

    new_date = MONTH[test[0:3].upper()].value + test[3:]

    date = dt.datetime.strptime(new_date, "%m%d%y")
    #use with label to get value
    print(MONTH['DEC'].value)
    #use with value to get label
    print(MONTH('10').name)

The advantage of the enum is that you can easily find the value with the name and if needed the inverse as illustrated in the code. The enum was introduced in 3.4 with additinoal pip install enum is now ( I think from 3.7) default in python. I tested it on windows by adding it into the code in stream.py line12901 but afterwards it failed on https://github.com/geomagpy/magpy/blob/ae91462f1644ebf89fbe570ce1cf67d1d89eb926/magpy/lib/format_imf.py#L1745-L1750 also because of the %b . If you prefer that everything gets cleaned up I propose that I checkout the origin/gui-fixes branch and eliminate/test all usages of the %b. Although the prerequisite and forcing it into en_US will solve the problems as well.

leonro commented 2 years ago

An alternative solution... the basic problem are related to these lines: https://github.com/geomagpy/magpy/blob/ae91462f1644ebf89fbe570ce1cf67d1d89eb926/magpy/stream.py#L12891-L12904

What about changing that completely by using dateutil.parser:

Here is a quick example how it works

>>> import locale
>>> import datetime as dt
>>> import dateutil.parser as dparser
>>> mydate = "OCT3121"
>>> locale.setlocale(locale.LC_TIME, ('de_DE', 'UTF-8'))
'de_DE.UTF-8'
>>> date = dt.datetime.strptime(mydate, '%b%d%y')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'OCT3121' does not match format '%b%d%y'
>>> date = dparser.parse(mydate[:5]+'-'+mydate[5:], dayfirst=True)
>>> date
datetime.datetime(2021, 10, 31, 0, 0)
>>> 

So we could change the specific code as follows and thus would avoid switching locale at all, which should be preferred. In the following we should not have problems with %b interpretations within the format libraries.

    # remove the getlocale/setlocale to en_US try/except clause
    try:
        tmpdaystring = daystring[-7:]
        date = dparser.parse(tmpdaystring[:5]+' '+tmpdaystring[5:], dayfirst=True)
        dateform = '%b%d%y'
    except:
        ...

Lines containing something like

    datetime.strptime(time,"%b%d%y_%H:%M")

could be replaced by

    dparser.parse(time[:5]+' '+time[5:].replace('_',' '), dayfirst=True)
stephanbracke commented 2 years ago

I tested it in windows and it seems to work. For the moment this method is independent of the locale and I hope that it will stay like that however we should check it every change of python version. You should replace all occurences of old_locale it is in stream.py but also in magpy_gui on different places https://github.com/geomagpy/magpy/blob/6a560b1cf7b9a0a76604250cbdb3ccf3aa2595a9/magpy/gui/magpy_gui.py#L3351-L3361 and also in format_imf.py ( on different places) https://github.com/geomagpy/magpy/blob/6a560b1cf7b9a0a76604250cbdb3ccf3aa2595a9/magpy/lib/format_imf.py#L3157-L3160 it was already known but put as comment https://github.com/geomagpy/magpy/blob/6a560b1cf7b9a0a76604250cbdb3ccf3aa2595a9/magpy/lib/format_imf.py#L3184-L3189 It is however better to not put dayfirst=True because the day isn't first in this pattern. I read the doc but they never promised that it will rest independent of the locale

leonro commented 2 years ago

I have changed all occurrences of locale accordingly (1d95cc3457240a304d070b4e620476960c7f98de) and also added a test environment into travis CI.