Closed levice14 closed 9 years ago
@Bonczidai did you have in the logs "session is down"??
no, not this time
another situation when session is down
appears - see log: http://codeviewer.org/view/code:4fd0
tasks:
- loop_docker_images_maintenance
- - - check_diskspace
- - - - - validate_linux_machine_ssh_access
- - - - - check_disk_space
- - - - - check_availability
- - - clear_unused_docker_images
- - - - - clear_docker_images
- - - - - - - validate_linux_machine_ssh_access
- - - - - - - get_all_images
- - - - - - - get_used_images
- - - - - - - - - validate_linux_machine_ssh_access_op
- - - - - - - - - get_used_images
- - - - - - - subtract_used_images
- - - - - - - verify_all_images_list_not_empty
- - - - - - - verify_used_images_list_not_empty
- - - - - - - get_parent_images
- - - - - - - - - validate_linux_machine_ssh_access
- - - - - - - - - inspect_image
- - - - - - - - - get_parent
- - - - - - - - - get_parent_name
- - - - - - - substract_parent_images
- - - - - - - delete_images
Command failed java.lang.RuntimeException: Slang Error : Error running: 'clear_unused_docker_images': Error binding output: 'total_amount',
Error is: Error in running script expression or variable reference, for expression: 'amount_of_images_deleted + amount_of_dangling_images_deleted',
Script exception is: javax.script.ScriptException: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' in <script> at line number 1
CALL_ARGUMENTS={privateKeyFile=/root/cloudslang-coreos, port=22, closeSession=false, username=core, arguments=, host=188.166.102.7, characterSet=UTF-8, pty=false, command= , password=, timeout=6000000}, PATH=0/2/0/2/0/1/0/3/0/1/0/0, TIMESTAMP=Wed Apr 22 07:54:31 EDT 2015, TYPE=EVENT_ACTION_START} 2015-04-22 07:54:32:562 628858 [WorkerExecutionThread-1_101600008] INFO io.cloudslang.lang.cli.SlangCLI - Event received: EVENT_ACTION_END Data is: {EXECUTIONID=101600008, DESCRIPTION=Action performed, PATH=0/2/0/2/0/1/0/3/0/1/0/0, RETURN_VALUES={returnCode=-1, returnResult=, STDERR=, STDOUT=}, TIMESTAMP=Wed Apr 22 07:54:32 EDT 2015, TYPE=EVENT_ACTION_END} {EXECUTIONID=101600008, DESCRIPTION=Action performed, PATH=0/2/0/2/0/1/0/3/0/1/0/0, RETURN_VALUES={returnCode=-1, returnResult=, STDERR=, STDOUT=}
@Bonczidai I told you the timeout needs to be higher. Otherwise the SSHActionCommand fails with no notice. When the step fails, the flow tries to bind whatever outputs it has, and the ones it needs do not exist because of the failure. I tried with timeout 90000000 and it worked for me. Not 100% sure whether or not it's the timeout,but it's definitely a content issue.
@Bonczidai for the one you posted here: http://codeviewer.org/view/code:4fd0
RETURN_VALUES={returnCode=-1, returnResult=, STDERR=, STDOUT=Untagged: busybox:latest Deleted: 4986bf8c15363d1c5d15512d5266f8777bfba4974ac56e3270e7760f6f0a8125 Deleted: ea13149945cb6b1e746bf28032f02e9b5a793523481a0a18645fc77ad53c4ea2}
Same issue. Timeout is 6000000, flow reaches SSHActionCommand, it actually removes images, but returnCode=-1 so step fails => parent flow can't bind outputs anymore.
seems that it failed with higher timeout as well. still the SSH though
what you mean by "still the SSH though"?? is it "session is down" issue?
@meirwah from what I observed, the SSH fails sometimes without having the "session is down" error. In the comment I wrote above you can see, the STDOUT contains deleted docker images, while the returnCode=-1 which fails the flow. So SSH can fail having the "session is down" message, no error message at all with valid STDOUT and no information besides returnCode-1. The thing is, everytime @Bonczidai posted a flow failure, I looked over the logs and the SSHCommandAction had returnCode=-1 while everything else being valid.
@tudorlesan , please try to either Debug the java action and/or consult content team about this
I think I understand the issue, in the flow "get_used_images_flow" the flow fails because of SSH connection (validate_linux_machine_ssh_access step, on -1), then it will try to resolve parent flow (clear_docker_images_flow) output (amount_of_images_deleted) : '0 if 'images_list_safe_to_delete' in locals() and images_list_safe_to_delete == '' else amount_of_images'
but amount_of_images does not exist since we did not get to the task that deletes it... We either need to verify in the expression amount_of_images is in locals , or look for this outputs only if result=0...
This happens in more places(not only in "get_used_images_flow") depending on where the SSH fails as I wrote above. I realized that avoiding those errors is possible with more validation. But the problem is, the flow will fail either way. The images will not be deleted because SSH fails. And not having those errors there might cause confusion as to why the flow actually failed.
Either way, unless I find the problem with the SSH, we will not be sure how many runs of the flow will actually do what we expect it to do.
Not reproducable
sporadic failure is back:
CLI output:
part of the log file: