cioos-siooc / ocean-data-parser

ocean-data-parser parses manufacturer or organizations proprietary file format into an xarray datasets.
GNU General Public License v3.0
4 stars 1 forks source link

Move away from eval in vocabulary mapping #94

Open JessyBarrette opened 2 months ago

JessyBarrette commented 2 months ago

Issue

For some variable vocabularies, an apply_func parameter is defined to derive a new variable from the existing one. The defined apply_func value is then evaluated via eval() to generate a new variable. This can create some security risks which should ideally be mitigated.

This has for objectif to help standardizing the different variables variance to a same standard. Here's a quick list of the different functions applied:

The related vocabularies are the different DFO offices related ones:

Fix

A few possibilities regarding how to fix this issue:

  1. Create a dedicated function that essentially identify those problematic variables and apply a correction. The function can be either specific to each parser or more global. Likely the first may be easier to maintain in the future.
  2. Attempt to still use eval or an equivalent while fixing the security issues.

@sjbruce feel free to add anything in there.

sjbruce commented 2 months ago

There are a few options that might be useful as a replacement for using eval() directly, one of the more promising modules is asteval: https://lmfit.github.io/asteval/index.html

It's primarily built to allow for the evaluation of mathematical expressions from a string, which would address the majority of issues with minimal modification.

It also has the ability to build custom symbol tables that appear to be able to accept custom functions as well, which may allow for importing of gsw functions. https://lmfit.github.io/asteval/api.html#utility-functions

This last point needs to be tested as it may be limited to numpy functions. If that doesn't work out the apply_func could be checked for "gsw." and then more explicitly execute the appropriate functions without using any form of eval().

JessyBarrette commented 2 months ago

I think this could potentially be useful if we had to do a lot of different equations. However, since we're only so far handling ~10 equations, maybe we can just create a simple function with the function as str, the dataset and related variable:

def apply_func(apply_func:str,ds:xarray, var:str):

    if apply_func == "lamba x: x*10":
        return ds[variable] * 10
    elif ...
       ...
    raise ValueError(....)

We could potentially reassess if needed. Thoughts?

sjbruce commented 2 months ago

True, it is a small list. We'll need tweak the CSV files to account for some small variations in the text here & there but that's fairly straight-forward.

Alright, let's do that first.