Build appears to succeed but is reported as a failure

clippermadness commented 6 years ago

https://app.shippable.com/bitbucket/thetalake/portal/runs/1099/1/console https://app.shippable.com/download/jobConsoles?jobId=5ae27395a74b0e0800aa7a07

This project was using a specific build image from Shippable and was succeeding: pre_ci_boot: image_name: drydock/u16ruball image_tag: v6.1.4

When I removed that section, in an effort to get my builds to run faster and use the default image on my node, which is v6.3.4, now it fails.

But every step of the build that I can see succeeded. What's happening?

manishas commented 6 years ago

@clippermadness we're investigating this.

clippermadness commented 6 years ago

@manishas Any update on this?

trriplejay commented 6 years ago

@clippermadness no update yet. it seems like something is happening in our script handler that is causing it to exit without marking successful, even though all steps have succeeded. How often does this happen? is it pretty regular?

next step might be for us to try to analyze the node right after this occurs, so maybe you could update here the next time you see this happen?

clippermadness commented 6 years ago

@trriplejay This happens every time in this project if I remove the pre_ci_boot section of shippable.yml. It is not intermittent.

clippermadness commented 6 years ago

@manishas @trriplejay ping :)

trriplejay commented 6 years ago

We haven't been able to find the root cause yet, but we did release v6.4.4. You could try changing your runtime version to this and see if it avoids the error.

The other workaround for now could be to just change your runtime version back to 6.1.4, since you know that version works. Then you at least wouldn't have to wait to pull the build image.

I notice that you're using ruby version 2.4.1. This version used to be available directly in our older images, but our 6.1.4 through 6.4.4 images have 2.4.3 instead, so if you were to specify this in your yml, you could avoid the ruby download/install that is happening now in your build. Perhaps this change would also avoid the strange failures. If you specifically need 2.4.1, then you could consider changing your runtime version to 5.8.2, which has ruby 2.4.1 pre-installed.

let me know if any of these suggestions work for you. We haven't been able to reproduce the error ourselves yet, so it's hard to say when we'll have more information. Thanks for your patience!

clippermadness commented 6 years ago

Ok cool - I'll give those ideas a shot and see if they work.

clippermadness commented 6 years ago

Ok I followed these steps and it's still failing. The only difference that I can find between the the logs of a build that succeeds and one that fails is as follows:

Successful with pre_ci_boot: https://app.shippable.com/bitbucket/thetalake/portal/runs/1162/1/console

Booting up CEXEC Running CEXEC script sudo docker rm -fv $CONTAINER_NAME c.exec.portal.1160.1

Failure without: https://app.shippable.com/bitbucket/thetalake/portal/runs/1160/1/console

Booting up CEXEC Running CEXEC script ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/5c0461c2-d85c-486e-987a-3f9e129b2bd4.sh' Exception Invalid or no script tags received ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/5c0461c2-d85c-486e-987a-3f9e129b2bd4.sh' Exception Invalid or no script tags received sudo docker rm -fv $CONTAINER_NAME c.exec.portal.1162.1

trriplejay commented 6 years ago

Have you tried setting your runtime version back to 6.1.4? since that image version seems to work that might be the best way to go to avoid pulling.

That error you mention is definitely related. Normally our script handler sets a flag once all commands have completed successfully to indicate the overall success of the job. That's the "script tag" that the error is referring to. For some reason, the tag isn't being set in this case, even though everything is working exactly as normal. I'm still unable to reproduce, but am continuing to investigate.

clippermadness commented 6 years ago

A couple more notes of this.

Switching back to 6.1.4 definitely works.

I also tried changing the underlying node in our subscription. We have been using a 14.04 node, but I changed that to 16.04. That didn't work with either 6.4.4 or 6.3.4: same error.

So at this point, this project builds using the 16.04 6.4.4 node with a pre_ci_boot section in shippable.yml that pulls the older 6.1.4 image.

If I get rid of the pre_ci_boot section and attempt to build on the 6.4.4 image, the build always fails with the above description.

Bit-Doctor commented 6 years ago

While upgrading to runtime 6.5.4 we noticed the same issue. https://app.shippable.com/github/thestorefront/tsf-api/runs/5331/1/console The Console tab doesn't show any problem while downloaded logs print:

Booting up CEXEC
Running CEXEC script
ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/f056c1bb-3dc4-4a47-8e97-63bf3415f385.sh'
Exception Invalid or no script tags received
ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/f056c1bb-3dc4-4a47-8e97-63bf3415f385.sh'
Exception Invalid or no script tags received
ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/c2349b56-a431-4fe6-9cbf-53f367adc770.sh'
Exception Invalid or no script tags received
ERROR:script_runner - script_runner:Command failed : ssh-agent bash -c 'ssh-add /tmp/ssh/00_sub;ssh-add /tmp/ssh/01_deploy; cd /root && /root/c2349b56-a431-4fe6-9cbf-53f367adc770.sh'
Exception Invalid or no script tags received

Also bumping to this version I had to add apt-get install libcurl4-openssl-dev in order to get libcurl. Another thing I noticed is when rebuilding failed runs we get an empty git_sync step and no build_ci. https://app.shippable.com/github/thestorefront/tsf-api/runs/5335/1/console

~/src/github.com/thestorefront/tsf-api ~
fatal: Not a git repository (or any of the parent directories): .git

We have cache enabled and resetting it properly run every steps.

aurelien-reeves commented 6 years ago

We have issue with the cache too. But nothing appears to be successful. The build_ci is even not executed, the process fail at "git_sync" step with message "this is not a git repo".

manishas commented 6 years ago

We’re working on fixing this. @rageshkrishna @ric03uec can look into the issue reported for git sync and build_ci not being executed...

manishas commented 6 years ago

Ping @ric03uec

ric03uec commented 6 years ago

@clippermadness @Bit-Doctor this has been fixed and will be available in the next release sometime early next week. This error is happening because of an underlying bug in rvm(https://github.com/rvm/rvm/issues/4416) that was closed recently. The bug was resetting the bash TRAPs in a few of the shippable scripts that are essential for their successful execution. without the TRAP functions, the cleanup functions were not getting called which resulted in failed builds without any actual errors.

We still haven't been able to test the rvm fix successfully(https://github.com/rvm/rvm/issues/4416#issuecomment-408830405) so we've added some custom logic to get around this issue which should fix the builds that're failing for you.

I'll keep this issue open till we do the release and you can verify everything is good at your end.

clippermadness commented 6 years ago

Fix verified using Shippable base image 6.7.4, ruby 2.4.1 and rails 5.2.0. Build times now 6m faster without having to pull the old image. Thanks!

ric03uec commented 6 years ago

this is now fixed, closing

Shippable / support

Build appears to succeed but is reported as a failure #4297