Issues when using `run_ttp`

abhi1693 commented 2 years ago

The proxy minion is throwing a lot of errors only when I have used run_ttp in my workflow

2022-05-03 07:14:53,327 [salt.loaded.ext.proxy.nornir_proxy_module                           :1001][ERROR   ][646543] Nornir-proxy MAIN PID 646543 job failed: {'task_fun': 'nornir_salt.plugins.tasks.netmiko_send_commands', 'kwargs': {'FH': 'x.x.x.x', 'add_details': True, 'enable': True, 'run_ttp': 'salt://ttp_interface_secondary_ip.txt', '__pub_fun': 'nr.cli', '__pub_arg': ['show run interface TenGigabitEthernet0/0/0.1830', {'FH': 'x.x.x.x', 'add_details': True, 'enable': True, 'run_ttp': 'salt://ttp_interface_secondary_ip.txt'}], '__pub_tgt': 'nornir_fremont', '__pub_jid': '20220503141450775369', '__pub_ret': '', '__pub_tgt_type': 'glob', '__pub_user': 'root', 'render': ['filename', 'commands'], 'commands': ('show run interface TenGigabitEthernet0/0/0.1830',), 'connection_name': 'netmiko'}, 'identity': {'jid': '20220503141450775369', 'uuid4': 'e738e0e2-a21c-42cb-8f37-31edc162683b', 'user': 'root', 'function': 'exec.nr.cli'}, 'name': 'nornir_salt.plugins.tasks.netmiko_send_commands'}, error:
'Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 984, in _worker
    output = run(
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1571, in run
    _download_files(download, kwargs, loader=loader)
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 662, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1153, in _download_files
    content = __salt__["cp.get_url"](kwargs[key], dest=None, saltenv=saltenv)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/cp.py", line 424, in get_url
    result = _client().get_url(
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 517, in get_url
    result = self.get_file(url, dest, makedirs, saltenv, cachedir=cachedir)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1173, in get_file
    hash_server, stat_server = self.hash_and_stat_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1420, in hash_and_stat_file
    hash_result = self.hash_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1413, in hash_file
    return self.__hash_and_stat_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1405, in __hash_and_stat_file
    return self.channel.send(load)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 454, in run_sync
    self.start()
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 757, in start
    raise RuntimeError("IOLoop is already running")
RuntimeError: IOLoop is already running
'

2022-05-03 07:14:53,325 [salt.loaded.ext.proxy.nornir_proxy_module                           :1001][ERROR   ][646543] Nornir-proxy MAIN PID 646543 job failed: {'task_fun': 'nornir_salt.plugins.tasks.netmiko_send_commands', 'kwargs': {'FH': 'x.x.x.x', 'add_details': True, 'enable': True, 'run_ttp': 'salt://ttp_interface_secondary_ip.txt', '__pub_fun': 'nr.cli', '__pub_arg': ['show run interface TenGigabitEthernet0/0/0.1829', {'FH': 'x.x.x.x', 'add_details': True, 'enable': True, 'run_ttp': 'salt://ttp_interface_secondary_ip.txt'}], '__pub_tgt': 'nornir_fremont', '__pub_jid': '20220503141450433928', '__pub_ret': '', '__pub_tgt_type': 'glob', '__pub_user': 'root', 'render': ['filename', 'commands'], 'commands': ('show run interface TenGigabitEthernet0/0/0.1829',), 'connection_name': 'netmiko'}, 'identity': {'jid': '20220503141450433928', 'uuid4': 'eb456242-50ff-4d54-8678-24d255cda8ca', 'user': 'root', 'function': 'exec.nr.cli'}, 'name': 'nornir_salt.plugins.tasks.netmiko_send_commands'}, error:
'Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 984, in _worker
    output = run(
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1571, in run
    _download_files(download, kwargs, loader=loader)
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 662, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1153, in _download_files
    content = __salt__["cp.get_url"](kwargs[key], dest=None, saltenv=saltenv)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/cp.py", line 424, in get_url
    result = _client().get_url(
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 517, in get_url
    result = self.get_file(url, dest, makedirs, saltenv, cachedir=cachedir)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1173, in get_file
    hash_server, stat_server = self.hash_and_stat_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1420, in hash_and_stat_file
    hash_result = self.hash_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1413, in hash_file
    return self.__hash_and_stat_file(path, saltenv)
  File "/usr/lib/python3/dist-packages/salt/fileclient.py", line 1405, in __hash_and_stat_file
    return self.channel.send(load)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 458, in run_sync
    raise TimeoutError('Operation timed out after %s seconds' % timeout)
salt.ext.tornado.ioloop.TimeoutError: Operation timed out after None seconds
'

Note: This is an intermittent issue. It works 10% of the time without errors

dmulyalin commented 2 years ago

Hi, thank you for raising this issue, will try to reproduce.

How many devices are you calling run_ttp for? Are you simultaneously running multiple jobs that can call run_ttp e.g. on schedule? Would this error be gone if you set nornir_workers : 1 in proxy minion setting? e.g.

# proxy_minion.sls file
proxy:
  proxytype: nornir
  nornir_workers: 1

What version are you running, could you share nr.nornir version output or part of it?

abhi1693 commented 2 years ago

I've ui from where a user clicks on a button and the system generates around 300-600 background jobs depending on how many interfaces the device has as we need to parse couple of things for each interface seperately. You may have noticed I'm the same user who raised the issue for adding support for Cisco ios nat in ttp templates. The log I have shared is for parsing the secondary IP of single interface but I have the same issue while parsing the NAT entries using the files I shared in the other issue.

Now, that you have asked about a single nornir worker, I have not tried it yet but will do, however whenever I had 1 rq worker which ran all the jobs 1 by 1, I saw almost no such error being called. But when we put the code in production to take it for a spin, he have minimum 10 rq workers which tried to run 10 jobs in parallel and we saw more of this issue. I've also noticed that the minion downloads the file into its cache but still calls the master with cp.get_url everytime irrespective of the cached file already present on the local filesystem. Maybe, if the code looked into the cache first to simply read the file if it already exist, wouldn't it fix this issue?

Regarding the information that you have asked of me, I'll get you get in a while as I'm currently away from my PC.

abhi1693 commented 2 years ago

Here is a result for the version you asked for

abhi1693 commented 2 years ago

Setting nornir_workers: 1 does not resolve the issue either.

abhi1693 commented 2 years ago

@dmulyalin Do you want any other information?

dmulyalin commented 2 years ago

Is it the same tracabck you seeing with 1 worker as before with multiple workers, mainly interested to know if you still seeing this line of code hit in the traceback:

  File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1153, in _download_files
    content = __salt__["cp.get_url"](kwargs[key], dest=None, saltenv=saltenv)

abhi1693 commented 2 years ago

Yes, it's the same

dmulyalin commented 2 years ago

Thank you for the details provided.

FYI, was able to reproduce the issue in lab environment, implemented a fix, going to test it further and include in next release. Also update the code to check and use cached files first before re-downloading them from master wherever possible.

abhi1693 commented 2 years ago

Is it possible that I can also test the changes to provide feedback?

dmulyalin commented 2 years ago

@abhi1693 pushed latest code to nornir-salt and salt-nornir master branches, can try installing from them if want to test:

python3 -m pip install git+https://github.com/dmulyalin/nornir-salt
python3 -m pip install git+https://github.com/dmulyalin/salt-nornir

I've introduced other few changes in that latest code, but all the tests are passing fine, so, should not have compatibility issues, but let me know if you encounter anything strange.

abhi1693 commented 2 years ago

This seems to be running good so far. I'm running 20 rq workers on my end and proxy is running with defaults workers and have not seen the errors yet

dmulyalin commented 2 years ago

Ok, that's great, when you saying 20rq workers, you mean 20 salt nornir proxy minions, or rq workers are minions like peaces of code that tap into saltstack event bus?

abhi1693 commented 2 years ago

rq workers: these run a single job that sends the request to the master using rest API. As of writing, I'm running 30 such workers

abhi1693 commented 2 years ago

I'm closing this as this has been working as expected for quite a while now

dmulyalin commented 2 years ago

thank you for confirming, release 0.11.0 release that has this fix integrated into it.

dmulyalin / salt-nornir

Issues when using `run_ttp` #9