cloudstateio / cloudstate

Distributed State Management for Serverless
https://cloudstate.io
Apache License 2.0
763 stars 97 forks source link

Add local cache images layer for docker images on build #514

Closed sleipnir closed 3 years ago

sleipnir commented 3 years ago

See discussion here https://github.com/cloudstateio/cloudstate/issues/512

Proposals to resolve this issue:

  1. Add calls to docker pull in the travis script before running tests via sbt.

  2. Increase TCK test timeout times

marcellanz commented 3 years ago

See also: https://github.com/cloudstateio/cloudstate/issues/512#issuecomment-763311736

marcellanz commented 3 years ago

@sleipnir CI just happend to work, by accident, as I restarted the jobs two times in europe daytime timezone: https://travis-ci.com/github/cloudstateio/cloudstate/jobs/473092960#L1322

I agree on the fs-layer timeouts happen because of some probably highly fetched layers. One I saw multiple times is 4f4fb700ef54. How can one find the origin of a lay identified by its hash. docker inspect/history does not reveal all of them by hash; although a simple google search for this hash reveals a good amount of references trough logfile snippets pulling them at various unrelated places. I saw some go related projects. It might be from one of its base images, golang and or alpine.

marcellanz commented 3 years ago

@sleipnir I'm surprised by the persistence of this timeout, excacly for this image. Also travis documentation states: https://docs.travis-ci.com/user/caching/#things-not-to-cache

I'm not sure how to progress with that.

sleipnir commented 3 years ago

Hi @marcellanz I think the key is here:

"Docker images are not cached, because we provide a new virtual machine for each build."

And this is exactly what we need, I explain:

When a job is launched Travis creates a virtual machine, installs everything he needs and runs the tasks defined in the CI file.

What happens in our case is that our tasks execute the tests and it is the tests that download the images, this affects the test execution time itself which leads to timeout errors.

We are not interested, yet, in speeding up build times. We are interested primarily that they perform without errors orthogonal to the tests themselves. That said, what we need Travis to do is:

When a job is launched Travis creates a virtual machine, install everything you need and run the tasks defined in the CI file. One of these tasks would be to download the images to the virtual machine before running the tests via sbt. The key here is to get images from the local disk at the time of the test instead of the network. If you look at the job logs that gave errors you will see that in the end all layers of the images are successfully downloaded, unfortunately this occurs after the test is aborted with a Timeout error. Bringing the image to the disk (no matter how long it lasts before running the test) will solve the problem. It is not exactly a cache that we need and that is why this documentation is more confusing than helpful.

I think that would be it. What do you think?

pvlugter commented 3 years ago

We should be able to just change the command to sh -c "docker pull ... && docker run ..." so that it's all in the TCK configuration still and this will run before it starts waiting with the timeout. But I'll look at adding support for running a preparing command, which it waits on, and do the pull first automatically for docker images. Will add this in the TCK, so we don't have to change docker images in multiple places.