cloudstateio / cloudstate

Distributed State Management for Serverless
https://cloudstate.io
Apache License 2.0
763 stars 97 forks source link

Upgrade to akka-persistence-spanner 1.0.0-RC5 #512

Closed hseeberger closed 3 years ago

hseeberger commented 3 years ago

See https://github.com/lightbend/cloudstatemachine/issues/2534 for details.

sleipnir commented 3 years ago

Following the good practices of OSS development and common sense there should be an open description here. Otherwise, it should be opened in some private LB repository.

hseeberger commented 3 years ago

@sleipnir sorry for that!

Actually everything should be covered here in open: https://github.com/akka/akka-persistence-spanner/issues/99

TL'DR: Bug in akka-persistence-spanner, fix in the RC5 release.

sleipnir commented 3 years ago

@hseeberger Thank you very much for the clarifications

hseeberger commented 3 years ago

@sleipnir, you're welcome. BTW, the PR does not pass CI and I cannot see how this should be related to my tiny change. If you have some time to take a look, that would be highly appreciated.

sleipnir commented 3 years ago

It looks like some layers of the 'cloudstateio/cloudstate-go-tck:latest' docker image were not downloaded in time.

sleipnir commented 3 years ago

@hseeberger To prevent this type of errors ([info] java.lang.AssertionError: timeout 60 seconds expired) I suggest increasing the test timeout time to double the current time

hseeberger commented 3 years ago

Thanks, @sleipnir. I am unsure if I should increase the timeout here in this innocent little PR.

marcellanz commented 3 years ago

Thanks, @sleipnir. I am unsure if I should increase the timeout here in this innocent little PR.

Obiviously, this change is orthogonal to what happens with the TCK for some time now. Checking the last couple of PR's shows the same issue over and over again, with changes unrelated to the failure in the CI run. As @sleipnir explained possible reasons for the CI run breaking, the explanation has nothing to do with this PR, obiviously.

Thanks, @sleipnir. I am unsure if I should increase the timeout here in this innocent little PR.

The suggestion by @sleipnir to increase a timeout is a general improvements suggestion, for sure not to be done with this PR but opening a another one and then try that out there.

I took some time, like @sleipnir did, to investigate this a bit:

Seeing the PR CI runs of the last few days, many failed like: https://travis-ci.com/github/cloudstateio/cloudstate/jobs/473092960#L1211 and one, in between, succeeded: https://travis-ci.com/github/cloudstateio/cloudstate/jobs/471982238#L1316

and a short test in a devcontainer where the TCK runs with no issues, I assume no functional issue here. This leaves the timeout suggestion by @sleipnir a good one we should try, where I'd prefer the option not racing with a timeout but getting the images before running the TCK.

What is not clear to me is, that despite a few re-runs of the travis job I did, always the fs layers of the same image (cloudstate-go-tck) got retries, which is no coincidence I think.

099354C6-2C67-47AE-84D7-D15B0DDDC5F6

sleipnir commented 3 years ago

Thank you @marcellanz for your time analyzing this innocent TCK problem.

I opened an issue to address the question where I suggested, as you also suggested here, download the images locally before the tests run. But unfortunately I don't have time to elaborate a PR right now, but I think we left all the necessary information here on which way to go in solving this problem, right!? I believe that this should resolve the issue without having to change the timeouts for the tests, which should make other CloudState contributors more comfortable with their PRs.

Thank you @marcellanz once again for your detailed analysis.

sleipnir commented 3 years ago

@marcellanz about the layers that always undergo retries seems to me to be a problem in the Docker Hub Registry, probably these layers are common to many repositories.

marcellanz commented 3 years ago

@sleipnir Thanks, see my comment in the issue of yours: https://github.com/cloudstateio/cloudstate/issues/514#issuecomment-763605809 we can track this there. The CI run for this issue went green after a few tries.

marcellanz commented 3 years ago

@sleipnir Additionally, I saw the TCK timing out similarly when a language TCK did not respond the initial discover call by the proxy properly (with all entities expected). This happenend implementing one of those language TCK images and might be a bug. I'll check if this is still the case. Saying that, the main cloudstate branch always can have failed TCK runs in CI as the :latest tags bring potentially effects of incompatibility when the main project and language supports diverge. Currently, I have not running the Nightly TCK runs anymore as travis decided to stop running them with out of free-time-credit issues. We might run them elsewhere.

sleipnir commented 3 years ago

Great job @marcellanz

marcellanz commented 3 years ago

I did nothing 🤷‍♂️ I only pushed CSS buttons 😆

pvlugter commented 3 years ago

I'll close this issue now, since #513 is merged and there's #514 for the docker pull timeout.

hseeberger commented 3 years ago

Thanks, @pvlugter, I forgot to close it 😬