Closed anthonyxpalermo closed 4 years ago
@anthonyxpalermo hey, thanks for reporting this!
A couple of qustions:
1) Have you tried to use another distributions? Is it difficult for you to try this against a Linux distribution (because i can reproduce it using our CI container based on Linux)
2) If you run another query, e.g. "SELECT 1" or anything else, the task hangs?
3) When the task's hanging, could you please, enter your database in another console and run "SHOW processlist;" and provide the output?
4) Try to remove ignore_unreachable
and ignore_errors
and run your initial playbook. If it fails, please, provide the output with -vvv command.
Waiting for your feedback! Thanks!
A couple of qustions:
Have you tried to use another distributions? Is it difficult for you to try this against a Linux distribution (because i can reproduce it using our CI container based on Linux)
If you run another query, e.g. "SELECT 1" or anything else, the task hangs? The Tasks does not always hang , out of 4000 entries it only hangs 3 times. The issue is that when it does hang it never moves forward unless A) Quit the playbook CTRL-C B) Use Ansible 2.10 with task timeout option .
When the task's hanging, could you please, enter your database in another console and run "SHOW processlist;" and provide the output?
Try to remove ignore_unreachable and ignore_errors and run your initial playbook. If it fails, please, provide the output with -vvv command.
2020-10-01 14:47:26,626 p=58280 u=root n=ansible |
a debug=True and -vvvv shows always the following
2020-10-01 14:47:26,627 p=58280 u=root n=ansible | 58280 1601578046.62702: opening command with Popen() 2020-10-01 14:47:26,631 p=58280 u=root n=ansible | 58280 1601578046.63149: done running command with Popen() 2020-10-01 14:47:26,632 p=58280 u=root n=ansible | 58280 1601578046.63273: getting output with communicate()
after that nothing.
Could it be OOM on the invoked subprocess? If that's the case the call to communicate would forever wait for a response that won't arrive. It's not the only possible cause, but it's one that won't show on the ansible logs.
@anthonyxpalermo Another question to help pinpointing this issue. Can you confirm that you're actually using this collection and not the Ansible bundled module? In your sample code the task module appears as mysql_query
rather than community.mysql.mysql_query
.
Thanks.
@anthonyxpalermo thanks for fast reply!
In addition to @Jorge-Rodriguez 's question, could you also please:
1) provide rows of ps aux
related to ansible processes on the target machine when a task hanging
2) do grep -i "fatal\|error\|warning\|panic\|something else" ...
of mysql and system logs after hanging (or share the logs if it's ok for you, i could parse them myself)
@anthonyxpalermo thanks for fast reply! In addition to @Jorge-Rodriguez 's question, could you also please:
provide rows of
ps aux
related to ansible processes on the target machine when a task hanging 68830 palea009 22 0 9537M 104M ? 0.0 0.3 0:02.17 │ │ │ └─ /Applications/Visual Studio Code.app/Contents/Frameworks/Code Helper (Renderer).app/Contents/MacOS/Code Helper (Renderer) /Users/palea009/.vscode/extensions/vscoss.vscode-ansible-0.5.2/node_modules/vscode-languageclient/lib/utils/el 69173 palea009 24 0 5046M 55544 ? 13.0 0.2 0:12.84 │ │ └─ /Users/palea009/.pyenv/versions/3.8.0/bin/python3.8 /Users/palea009/.pyenv/versions/3.8.0/bin/ansible-playbook PlayBooks/AWS/roles/aws_acm_last_mile_scanner.yml -i inventory/certdiscoverytools --vault-id vault-ids/.prod.vaultpass -e 69304 palea009 16 0 5048M 43140 ? 0.0 0.1 0:00.27 │ │ ├─ /Users/palea009/.pyenv/versions/3.8.0/bin/python3.8 /Users/palea009/.pyenv/versions/3.8.0/bin/ansible-playbook PlayBooks/AWS/roles/aws_acm_last_mile_scanner.yml -i inventory/certdiscoverytools --vault-id vault-ids/.prod.vaultpass 69321 palea009 32 0 4296M 1192 ? 0.0 0.0 0:00.00 │ │ │ └─ /bin/sh -c /usr/bin/python /Users/palea009/.ansible/tmp/ansible-tmp-1601906737.595197-69304-248465093681409/AnsiballZ_aws_acm_info.py && sleep 0 69322 palea009 24 0 5098M 61340 ? 2.9 0.2 0:03.61 │ │ │ └─ /usr/bin/python /Users/palea009/.ansible/tmp/ansible-tmp-1601906737.595197-69304-248465093681409/AnsiballZ_aws_acm_info.pydo
grep -i "fatal\|error\|warning\|panic\|something else" ...
of mysql and system logs after hanging (or share the logs if it's ok for you, i could parse them myself) Mysql is in AWS RDS , CloudWatch shows no issues on the DB side, its also tru that other ansible playbooks writing to db. Either way we need somehow a way to break out of the hanging state , I agree with OOM on the invoked subprocess, break out and retry
@anthonyxpalermo Another question to help pinpointing this issue. Can you confirm that you're actually using this collection and not the Ansible bundled module? In your sample code the task module appears as
mysql_query
rather thancommunity.mysql.mysql_query
. Thanks.
made change from mysql_query
to community.mysql.mysql_query
will post once playbook ends
issues remains even with community.mysql.mysql_query
TASK [aws-acm : Insert {{hostname_item}} and {{ports_item}} Records to our DataBase] *** fatal: [localhost]: FAILED! => {"changed": false, "msg": "The community.mysql.mysql_query action failed to execute in the expected time frame (10) and was terminated"}
the records for these failures are not added to the DB. I did check that this does not happen to a specific record has previous run same record did not fail, so its not related to the record info being added to the DB
@anthonyxpalermo could you also try to run the same query using shell module?
Just interesting if this hangs again. If yes, this will clarify the picture
@anthonyxpalermo could you also try to run the same query using shell module?
How do I do this ?
@anthonyxpalermo could you also try to run the same query using shell module?
How do I do this ?
- shell: mysql -e "YOUR QUERY HERE"
Connection parameters notwithstanding. Essentially the command you'd run by hand in bash as a parameter to the shell
module.
In fact, it would be interesting to know if the query succeeds when run by hand too.
Also, on your psaux output, I can see processes running the aws_acm_info
module but none running the mysql_query
module.
You may also want to use the quote filter instead of regex_replace. If this is an OOM, that might help.
Also, on your psaux output, I can see processes running the
aws_acm_info
module but none running themysql_query
module.
Not sure why in MacOS you don't see that. the module mysql query runs and after it complete the playbook cycles again
You may also want to use the quote filter instead of regex_replace. If this is an OOM, that might help.
do you mean here ? 'query:
could you post how ? I am not familiar with how to use that option
@anthonyxpalermo could you also try to run the same query using shell module?
How do I do this ?
- shell: mysql -e "YOUR QUERY HERE"
Connection parameters notwithstanding. Essentially the command you'd run by hand in bash as a parameter to the
shell
module.In fact, it would be interesting to know if the query succeeds when run by hand too.
It sounds like I would need to refactor the whole query instead of just dropping in because I need to account for Connection Parameters.
I can say that we use same query on smaller playbooks , lot less data to pass and we never have this issue. There are also other playbooks like mentioned before that connect and write to this DB none have issues.
In that case, I would not think the issue lies within the module itself. Maybe you can tweak the playbook concurrency, or maybe the issue lies with the concurrent connections in RDS. I've had issues with connections not being closed properly (which I could verify we're handling properly).
Ultimately, if the dataset is too big, it could be split into smaller batches, but I don't know if that would make too big of a difference, maybe if a small delay is introduced between batches.
@Andersson007 @anthonyxpalermo OK, I took a quick cursory look at the code and it seems that we're not explicitly closing the connections. And while the GC should take care of those objects upon module completion, it's not the first time I've seen the server hanging on to stale connections in these kind of scenarios. In the case of a large dataset like @anthonyxpalermo is describing, it stands to reason that the server could run out of its socket allocation and so the task stalls waiting for one of the stale sockets to time out and become available.
I can start a PR to explicitly close the connections on all modules in this collection, as it seems none of them does.
@anthonyxpalermo You might have to wait a few days for this, and I cannot guarantee that it will fix your issue.
@Jorge-Rodriguez ok, will wait for the pr then (i think no changelog fragment needed while we don’t know if it helps or not, or maybe we can put an unrelated fragment, i.e. about refactoring)
In that case, I would not think the issue lies within the module itself. Maybe you can tweak the playbook concurrency, or maybe the issue lies with the concurrent connections in RDS. I've had issues with connections not being closed properly (which I could verify we're handling properly).
Ultimately, if the dataset is too big, it could be split into smaller batches, but I don't know if that would make too big of a difference, maybe if a small delay is introduced between batches.
Jorge, I appreciate the input but I believe this is an issue with the code of the module and how it handles connections. scaling to bigger amount of request is just exposing the issue rendering it visible
@Andersson007 @anthonyxpalermo OK, I took a quick cursory look at the code and it seems that we're not explicitly closing the connections. And while the GC should take care of those objects upon module completion, it's not the first time I've seen the server hanging on to stale connections in these kind of scenarios. In the case of a large dataset like @anthonyxpalermo is describing, it stands to reason that the server could run out of its socket allocation and so the task stalls waiting for one of the stale sockets to time out and become available.
I can start a PR to explicitly close the connections on all modules in this collection, as it seems none of them does.
@anthonyxpalermo You might have to wait a few days for this, and I cannot guarantee that it will fix your issue.
Thank you ! I will stand by so that we can review once the PR is completed. I would add that I tend to agree with what you are stating even because when using the modules connect_timeout, it still hangs and does not move forward.
@anthonyxpalermo Check out #44. If you can apply those changes locally and tell us whether that helps with your issue it would be greatly appreciated.
@anthonyxpalermo also just read in the doc For example, a task may take longer to complete than the SSH session allows for, causing a timeout.
.
Could you please check SSH related setting on your target machine?
https://www.tecmint.com/increase-ssh-connection-timeout/#:~:text=On%20the%20server%2C%20head%20over,%2Fssh%2Fsshd_config%20configuration%20file.&text=The%20ClientAliveInterval%20parameter%20specifies%20the,to%20keep%20the%20connection%20alive.
i'm not an expert in such things but https://docs.ansible.com/ansible/devel/user_guide/playbooks_async.html#run-tasks-concurrently-poll-0 might be also helpful.
If I read @anthonyxpalermo logs correctly, his target is localhost, not sure if he's using the local connection too.
If I read @anthonyxpalermo logs correctly, his target is localhost, not sure if he's using the local connection too.
I have tried with connection local on and off and same results
@anthonyxpalermo Check out #44. If you can apply those changes locally and tell us whether that helps with your issue it would be greatly appreciated.
Running playbook with community.mysql #44
If I read @anthonyxpalermo logs correctly, his target is localhost, not sure if he's using the local connection too.
I have tried with connection local on and off and same results
This really looks like a connection listener exhaustion issue. Awaiting news from the run with #44
If I read @anthonyxpalermo logs correctly, his target is localhost, not sure if he's using the local connection too.
I have tried with connection local on and off and same results
This really looks like a connection listener exhaustion issue. Awaiting news from the run with #44
Sorry to say but even with #44 the issue persist
TASK [aws-acm : Insert {{hostname_item}} and {{ports_item}} Records to our DataBase] **
just waiting here I will rerun in debug mode to provide where it stops
Let me check whether the code is doing what is meant to be doing in the meanwhile
@anthonyxpalermo I can't see anything obviously wrong with the code. Can you connect to the database while it's running and post the output of
show status like "%conn%";
On my local tests I can not see any more than one single concurrent connection at any given time, with or without #44
@anthonyxpalermo one more question. You mentioned your database is in RDS, what are you using exactly, mysql, Mariadb or aurora with mysql compatibility?
@anthonyxpalermo one more question. You mentioned your database is in RDS, what are you using exactly, mysql, Mariadb or aurora with mysql compatibility?
MariaDB
@anthonyxpalermo one more thing that I don't understand. The task that you posted inserts one single row into the database. How exactly do you run through the 4000 entries that you mentioned?
@anthonyxpalermo one more thing that I don't understand. The task that you posted inserts one single row into the database. How exactly do you run through the 4000 entries that you mentioned?
One at a time . I get the info I need via a set_fact and that ends up in the query call I posted at beginning of the post.
@anthonyxpalermo could you post the playbook, redacted as needed?
On an additional note, you said you're using Ansible 2.10.1 which is not released yet. Any reasons you're using a development version?
@anthonyxpalermo I'd be curious as to how exactly you're running this. I tried to replicate your setup and run the task in a 4000 iteration loop with async and at no point I had more than 10 active threads in RDS
@anthonyxpalermo could you post the playbook, redacted as needed?
Hi Jorge, have not forgotten this just a busy deadline day !
@anthonyxpalermo I'd be curious as to how exactly you're running this. I tried to replicate your setup and run the task in a 4000 iteration loop with async and at no point I had more than 10 active threads in RDS
Althout there are 4000+ items , each index of the loop runs synchronously so for each item looped we first right to the db and then move to the. next. We dont collect all data and async write.
So you're running a sequential loop on the 4000+ items. That's essentially what I did myself and I can't reproduce. I'd suggest you check your active threads in your Mariadb instance, the process list and the slow query log. I would suspect a side interaction if there are other queries going on in the database.
With the current info I cannot pinpoint anything wrong with the module's code
So you're running a sequential loop on the 4000+ items. That's essentially what I did myself and I can't reproduce. I'd suggest you check your active threads in your Mariadb instance, the process list and the slow query log. I would suspect a side interaction if there are other queries going on in the database.
With the current info I cannot pinpoint anything wrong with the module's code
Jorge a couple of things We don't believe this has anything to do with MariaDB and even if it did the module needs to still be able to cut out an reply or move on. That its not to say that the issue could be MariaDB but in any case the module has to be able to cut out when there is no replies from the DB Agree ?
@anthonyxpalermo I mostly agree with you, but not necessarily in all cases. If the module does send the query and it's stalled waiting for a response that never comes it's not quite up to the module to arbitrarily give up, although there could be a task timeout option.
Now you use the connect timeout option but that's not quite the same, I think we're looking at a scenario where the connection works but we're not getting a response from the server, the reason for which is still unknown. In this case it won't be desirable for the module to give up arbitrarily since we can't make assumptions over how long a given query should take.
That being said, you can replicate a task timeout (wait for a result for up to n time units) by using the async
and poll
options.
Ultimately, I can't find anything substantially wrong with the module that could be causing your issue, and the fact that your error rate is <0.1% seems to suggest that the module works generally well. I feel inclined to believe that you're experiencing a situational issue with your database rather than a bug in the module. I would suggest you try the 'async' and 'poll' workaround.
@Andersson007 can you reopen this? I would like to hear from @anthonyxpalermo first.
Oh, sorry, it was closed automatically by merging @Jorge-Rodriguez 's PR
@Jorge-Rodriguez In Ansible 2.10.x there is now the ability to add a timeout value to the task. It can be applied elsewhere but for us using in the task has helped us move the overall playbook without waiting for ever for that close response. In my opinion its a workaround to a function that module should handle on its own and it also forces other community members to rely on moving to Ansible 2.10. So far its the ONLY solution to this issue. I will be deploying this to RHEL 7.x and possibly 8 to see if there is a platform dependency but I dont have a timeline yet. Of course async and poll options need to be tested too but FYI we are not looking to move our current code base to those options.
- name: "Insert {{hostname_item}} and {{ports_item}} Records to our DataBase"
- name: "Insert registered results to our MariaDB"
timeout: 10
community.mysql.mysql_query:
login_db: "{{db_prod}}"
login_port: "{{db_port_prod }}"
login_host: "{{db_host_prod }}"
login_user: "{{db_user_prod }}"
login_password: "{{ db_password_prod }}"
# connect_timeout: 5
query:
- INSERT INTO etl_in_aws_acm_certs_last_mile ( {{ acm_to_db.keys() | join(', ') }} ) VALUES ( {{ acm_to_db.values() | map('regex_replace', '^(.*)$', "'\1'") | join(', ') }} )
single_transaction: yes
ignore_unreachable: true
ignore_errors: true
@anthonyxpalermo the async and poll options have the same effect as the timeout option, this has been the way to deal with task timeouts up until ansible 2.10, so this issue is not forcing anybody to upgrade.
As I said before, we can't make the module arbitrarily give up because, while that would solve your problem, it will cause the converse problem for people using the module for long running queries only to have the module give up half way through. Adding a timeout option to the module makes little sense since Ansible itself has mechanisms for that.
Ultimately, needing a task to timeout because of a slow query (since we've ruled out connection issues) is a DBA problem, not an automation one.
Honestly, I don't see what improvement you'd expect to see on this module.
@Andersson007 since Ansible provides functionality for timeouts, I don't think it's this collections responsibility to replicate that. In my opinion this is not a bug and the issue can be closed.
As a best effort, we could implement read_timeout
and write_timeout
options.
SUMMARY
ISSUE TYPE
COMPONENT NAME
mysql_query
ANSIBLE VERSION
CONFIGURATION
OS / ENVIRONMENT
MacOS local connecton
STEPS TO REPRODUCE
EXPECTED RESULTS
I expect that the task completes and moves to the next task
ACTUAL RESULTS
Task hangs never moves on even if the mysql_query option connect_timeout: 5 is used.