geopython / stetl

Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
https://www.stetl.org
GNU General Public License v3.0
85 stars 35 forks source link

IndexError list index out of range exception when creating StringSubstitutionFilter #104

Open fsteggink opened 4 years ago

fsteggink commented 4 years ago

I'm getting an IndexError: list index out of range exception when creating a StringSubstitutionFilter. Stack trace:

  File "/usr/src/app/etl/tasks.py", line 66, in stetl_task
    etl.run()
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/etl.py", line 154, in run
    chain.assemble()
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/chain.py", line 87, in assemble
    etl_comp = factory.create_obj(self.config_dict, etl_section_name.strip())
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/factory.py", line 28, in create_obj
    raise e
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/factory.py", line 25, in create_obj
    class_obj_inst = self.new_instance(class_obj, configdict, section)
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/factory.py", line 62, in new_instance
    return class_obj(configdict, section)
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/filters/stringfilter.py", line 62, in __init__
    self.format_args_dict = Util.string_to_dict(self.format_args, self.separator)
  File "/usr/local/lib/python3.7/dist-packages/Stetl-2.1.dev0-py3.7.egg/stetl/util.py", line 112, in string_to_dict
    x[1] = x[1].replace(space, ' ')
IndexError: list index out of range

This happens when the config file contains placeholders which are passed through the command line and when the value contains spaces which are represented with tildes. Example: stetl -c blah.cfg -a myvalue=contains~space

Previously, as a workaround, I passed those values as environment variables, like export STETL_myvalue=contains~space. Then this error doesn't occur.

As you can see, this occurs in Stetl version 2.1-dev, but this also happened before the 2.0 versions.

During debugging, I found out that after I create the ETL object (etl = ETL(vars(args_parsed), args_parsed.config_args)) and show the config_dict, the relevant section is shown like this:

[my_filter]
class = stetl.filters.stringfilter.StringSubstitutionFilter
format_args = myvalue:contains space

It is clear that the tilde is replaced by a space earlier in the process, at the creation of the ETL object.

So, when this is passed to string_to_dict, the dict_arr will look like this:[['myvalue','contains'],['space']], which obviously causes the IndexError, since the second array only contains one element.

I haven't looked yet where this error exactly occurs. This must happen after extra arguments are passed through -a, but not when arguments are passed as environment variables with the 'STETL_'-prefix.

fsteggink commented 4 years ago

More info: the replaced tildes already occur in the options_dict and args_dict which are passed to the ETL constructor. They're coming from the parse_args function in stetl.main.

I suspect this piece of code is responsible for this error:

    if args.config_args:
        args_total = dict()
        for arg in args.config_args:
            if os.path.isfile(arg):
                log.info('Found args file at: %s' % arg)
                args_total = Util.merge_two_dicts(args_total, Util.propsfile_to_dict(arg))
            else:
                # Convert string to dict: http://stackoverflow.com/a/1248990
                args_total = Util.merge_two_dicts(args_total, Util.string_to_dict(arg))

        args.config_args = args_total 

Note that in the else case, Util.string_to_dict is called, which is also called in the constructor of the StringSubstitutionFilter (see stack trace above). The solution is to make sure that Util.string_to_dict is only called once. Preferably this should be done when the StringSubstitutionFilter, since it needs to create a dictionary from the format_args string. On the other hand, I'm also using it to inject arguments to placeholders in my config in other places, but maybe spaces instead of tildes can be safely used there.

For now, I'll use another character as a workaround in my StringSubstitutionFilter, since it is not really a problem in my case.

justb4 commented 4 years ago

At least the tilde and separator are assigned/used as defaults in util.py string_to_dict():

Convert a string to a dict

@staticmethod
def string_to_dict(s, separator='=', space='~'):
    # Convert string to dict: http://stackoverflow.com/a/1248990
    dict_arr = [x.split(separator) for x in s.split()]
    for x in dict_arr:
        x[1] = x[1].replace(space, ' ')

    return dict(dict_arr)

Think this was introduced in one of the first versions to support long strings for ogr2ogr exec.

In hinsight the .ini file format for Stetl config is not ideal. These days json, yaml, toml and the like are more standard, and are more lenient to strings and even whole texts (especially yaml).

For now: maybe there is a way to change the defaults = and ~ for string_to_dict. Environment var? But is not so transparent...

sebastic commented 2 years ago

This also happens when trying to import bagv2:

$ ./bagv2/etl/etl.sh -v -a ./bagv2/etl/options/hostname.args
INFO: 21-11-18 10:44:52 - Using options_file=options/hostname.args and user_args=-c conf/etl-imbag-2.1.0.cfg -v -a ./bagv2/etl/options/hostname.args
2021-11-18 10:44:52,461 util INFO Found lxml.etree, native XML parsing, fabulous!
2021-11-18 10:44:52,566 util INFO Found GDAL/OGR Python bindings, super!!
2021-11-18 10:44:52,571 main INFO Stetl version = 2.1.dev0
2021-11-18 10:44:52,573 main INFO Found args file at: /home/bas/software/nlextract/git/bagv2/etl/options/common.args
2021-11-18 10:44:52,573 main INFO Found args file at: options/hostname.args
Traceback (most recent call last):
  File "/home/bas/software/nlextract/git/externals/stetl/bin/stetl", line 43, in <module>
    main()
  File "/home/bas/software/nlextract/git/externals/stetl/bin/stetl", line 27, in main
    args = parse_args(sys.argv[1:])
  File "/home/bas/software/nlextract/git/externals/stetl/stetl/main.py", line 55, in parse_args
    args_total = Util.merge_two_dicts(args_total, Util.string_to_dict(arg))
  File "/home/bas/software/nlextract/git/externals/stetl/stetl/util.py", line 112, in string_to_dict
    x[1] = x[1].replace(space, ' ')
IndexError: list index out of range