cedadev / nappy

NASA Ames Processing in PYthon (NAPPy) - a Python library for reading, writing and converting NASA Ames files.
BSD 3-Clause "New" or "Revised" License
9 stars 13 forks source link

Some data providers encode var names and units in a format that Nappy doesn't parse #45

Open LukeJones123 opened 2 years ago

LukeJones123 commented 2 years ago

Description

Some of the files in the NDACC database do not have variable name+units strings that conform to the expectations of Nappy. For example, some providers separate the two with a semicolon, e.g.

N2O vertical column amount; molecules/cm**2
HNO3 vertical column amount; molecules/cm**2
ClONO2 vertical column amount; molecules/cm**2
HCl vertical column amount; molecules/cm**2
HF vertical column amount; molecules/cm**2

This issue could be addressed by a more complicate regular expression, but others are more convoluted and regular expressions aren't enough. Thus would be good if a custom user callback function could be allowed to do the parsing.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
FObersteiner commented 2 years ago

I have the same problem...

For example, my VNAMEs look like "Ozone (O3); volume mixing ratio; [ppb]" - so the pattern would be

name; unit description; unit, or r'(?P<name>.*)\;\ (?P<description>.*)\;\ (?P<unit>.*)' as a regex in Python.

However the default pattern is

myfile.var_and_units_pattern
Out[32]: re.compile(r'^\s*(.*)\((.+?)\)(.*)\s*$', re.UNICODE)

function _attemptVarAndUnitsMatch seems to do the pattern matching (name, unit, description) - but how to set the pattern correctly? Could we extend the NACore class by a setter method for the var_and_units_pattern attribute?

LukeJones123 commented 2 years ago

@FObersteiner, I have a branch that allows you to supply your own Python function to separate the names from the units: https://github.com/LukeJones123/nappy/tree/feature-var_units_parsing. Hopefully Ag will release a new version with this merged in soon.

FObersteiner commented 2 years ago

@LukeJones123 out of curiosity, why did you implement it as a callback? Wouldn't it be more straight-forward to add a setter method for the class attribute, so the user can provide a custom regex before actually reading the file? Also, your callback func resolves to two variables instead of the default triplet v1, unit, v2. What if I want to resolve vname to three (or more) components?

LukeJones123 commented 2 years ago

A custom regex was the obvious first choice but unfortunately I find the strings are so heterogeneously formatted that I need to implement some logic to catch all cases, hence a callback function. A custom regex would be a good additional option to have, for those with simpler needs. Of course, the best option would be to simply stop using Nasa Ames - I hate this format for exactly these reasons!

FObersteiner commented 2 years ago

@LukeJones123 let me propose a way that does not require to modify the API but satisfies your needs nevertheless!

1- initialize var_and_units_pattern as an underscored class attribute:

def __init__(self):
    # ...
    self._var_and_units_pattern = re.compile(r"^\s*(.*)\((.+?)\)(.*)\s*$")

2- make it a property with a setter:

@property
def var_and_units_pattern(self):
    """
    a regex or function to parse vnames to (name & description, unit)
    """
    return self._var_and_units_pattern

@var_and_units_pattern.setter
def var_and_units_pattern(self, new_pattern):
    if not isinstance(new_pattern, (re.Pattern, types.FunctionType)):
        raise TypeError(f"new pattern must be re.Pattern or function, "
                        f"not {type(new_pattern)}")
    self._var_and_units_pattern = new_pattern

3- include the callback function option in _attemptVarAndUnitsMatch:

def _attemptVarAndUnitsMatch(self, item):
    if isinstance(self._var_and_units_pattern, types.FunctionType):
        return self._var_and_units_pattern(item)

    match = self._var_and_units_pattern.match(item)
    if match:
    # ...

---> now you can easily set your custom parser or regex after loading the file:

f.var_and_units_pattern = custom_parser

--> A working implementation in my fork

LukeJones123 commented 2 years ago

Hi Florian,

A problem with your suggestion is that a callback function is not the same thing as a regular expression. Having an attribute called var_and_units_pattern that is sometimes set to a regular expression and sometimes set to a function is a bit confusing. Since they are such different things I think a callback function should have a separate attribute.

Your suggestion would be good as a way to set the pattern to a non-default value, and being able to provide a non-default pattern would also be a good thing, but that should be a separate development I think.

Luke.

FObersteiner commented 2 years ago

@LukeJones123 you're right, the attribute name would be misleading if it can be a function.

LukeJones123 commented 2 years ago

Am I correct that what you're suggesting is, instead of passing the custom parser function as an input to the NAFile init() method, you first create the NAFile object and then use a setter to set the attribute? The trouble with that is it means the init() method will never be able to make any attempt to read the file, because it must allow for the caller to change attributes relevant to file parsing even after object initialisation. I think, at the moment, the file header is actually read by init(). Even if it wasn't, under the current API it could be changed to do so in future. Allowing the setting of attributes required for file parsing after initialisation will prevent that forever (unless you implement a backwards incompatible API change). In that respect, this attribute is similar to the ignore_header_lines argument, which is also passed to init.

I know the way I've done it looks a bit clunky but I think it's the best way.

FObersteiner commented 2 years ago

@LukeJones123 why shouldn't you be able to set an attribute (with a setter) upon initialization? Simple example:

class myClass:
    def __init__(self, **kwargs):
        self.a_string = kwargs['the_string'] if "the_string" in kwargs else None

    @property
    def a_string(self):
        return self._a_string

    @a_string.setter
    def a_string(self, value):
        self._a_string = value

c = myClass()
print(c.a_string)
# None

c_with_string = myClass(the_string="hello")
print(c_with_string.a_string)
# hello

I think the **kwargs approach matches well with what you need, no?

The advantage would be that we could also add deleter; meaning that you can change or remove the custom parser func after initialization of the na file class instance.

cheers, Florian

LukeJones123 commented 2 years ago

In your example the setter serves no purpose if modifying self._a_string after initialisation is too late to have the desired effect, which is the case for this callback function, so this does not match with the use case in question. There is also no sensible use case for deleting the callback function at a later time. Can we please just agree on the current solution?

FObersteiner commented 2 years ago

Again, I don't think this is true. Anyways, what about Ag merges your PR and I'll have a look if I can make it look a bit less chunky afterwards - without breaking the functionality you need of course.

FObersteiner commented 2 years ago

@LukeJones123 @agstephens I think this can be closed, no?

LukeJones123 commented 2 years ago

For me the issue will be closed when the changes are released in the next version - 2.0.2.