Closed abhi1693 closed 2 years ago
Hi, thank you for raising this issue, will try to reproduce.
How many devices are you calling run_ttp for?
Are you simultaneously running multiple jobs that can call run_ttp e.g. on schedule?
Would this error be gone if you set nornir_workers : 1
in proxy minion setting? e.g.
# proxy_minion.sls file
proxy:
proxytype: nornir
nornir_workers: 1
What version are you running, could you share nr.nornir version
output or part of it?
I've ui from where a user clicks on a button and the system generates around 300-600 background jobs depending on how many interfaces the device has as we need to parse couple of things for each interface seperately. You may have noticed I'm the same user who raised the issue for adding support for Cisco ios nat in ttp templates. The log I have shared is for parsing the secondary IP of single interface but I have the same issue while parsing the NAT entries using the files I shared in the other issue.
Now, that you have asked about a single nornir worker, I have not tried it yet but will do, however whenever I had 1 rq worker which ran all the jobs 1 by 1, I saw almost no such error being called. But when we put the code in production to take it for a spin, he have minimum 10 rq workers which tried to run 10 jobs in parallel and we saw more of this issue. I've also noticed that the minion downloads the file into its cache but still calls the master with cp.get_url everytime irrespective of the cached file already present on the local filesystem. Maybe, if the code looked into the cache first to simply read the file if it already exist, wouldn't it fix this issue?
Regarding the information that you have asked of me, I'll get you get in a while as I'm currently away from my PC.
Here is a result for the version you asked for
Setting nornir_workers: 1
does not resolve the issue either.
@dmulyalin Do you want any other information?
Is it the same tracabck you seeing with 1 worker as before with multiple workers, mainly interested to know if you still seeing this line of code hit in the traceback:
File "/usr/local/lib/python3.8/dist-packages/salt_nornir/proxy/nornir_proxy_module.py", line 1153, in _download_files
content = __salt__["cp.get_url"](kwargs[key], dest=None, saltenv=saltenv)
Yes, it's the same
Thank you for the details provided.
FYI, was able to reproduce the issue in lab environment, implemented a fix, going to test it further and include in next release. Also update the code to check and use cached files first before re-downloading them from master wherever possible.
Is it possible that I can also test the changes to provide feedback?
@abhi1693 pushed latest code to nornir-salt and salt-nornir master branches, can try installing from them if want to test:
python3 -m pip install git+https://github.com/dmulyalin/nornir-salt
python3 -m pip install git+https://github.com/dmulyalin/salt-nornir
I've introduced other few changes in that latest code, but all the tests are passing fine, so, should not have compatibility issues, but let me know if you encounter anything strange.
This seems to be running good so far. I'm running 20 rq workers on my end and proxy is running with defaults workers and have not seen the errors yet
Ok, that's great, when you saying 20rq workers, you mean 20 salt nornir proxy minions, or rq workers are minions like peaces of code that tap into saltstack event bus?
rq workers: these run a single job that sends the request to the master using rest API. As of writing, I'm running 30 such workers
I'm closing this as this has been working as expected for quite a while now
thank you for confirming, release 0.11.0 release that has this fix integrated into it.
The proxy minion is throwing a lot of errors only when I have used
run_ttp
in my workflowNote: This is an intermittent issue. It works 10% of the time without errors