Server start fails due to Elasticsearch not ready yet?

SEPIA-Framework / sepia-docs

Documentation and Wiki for SEPIA. Please post your questions and bug-reports here in the issues section! Thank you :-)

https://sepia-framework.github.io/

238 stars 16 forks source link

Server start fails due to Elasticsearch not ready yet? #131

Closed fquirin closed 2 years ago

fquirin commented 2 years ago

Describe the bug When inside a Docker container the server startup fails with the assist-server complaining about account validation (core accounts). An additional wait after Elasticsearch start seems to fix this. It seems Elasticsearch isn't 100% ready yet.

SEPIA client and server versions

SEPIA-Server version: SEPIA-Home v2.6.0

Solution: Maybe we can improve the wait script that checks if Elasticsearch is ready.

fquirin commented 2 years ago

@BieJay93 can you try this modification to your run-sepia.sh instead of the extended sleep please?

es_yellow_or_green=$(curl --silent -XGET 'http://localhost:20724/_cluster/health?pretty=true&wait_for_status=yellow&timeout=30s' | grep -E "status.*(green|yellow)" | wc -l)
if [ $es_yellow_or_green -eq 1 ]; then
    echo 'Status YELLOW or GREEN: true'
else
    echo 'Status RED or unknown! Abort.'
    exit 1
fi

BieJay93 commented 2 years ago

Sure...I created a new docker container with this modification and it's starting up. Status of ES is yellow. But the first curl-response takes around 30s. I also added a second curl of this URL, and it's responding immediately. So it seems ES really needs some more time to start up completely.

fquirin commented 2 years ago

Hey, thanks for testing.

Are you saying this curl --silent -XGET 'http://localhost:20724/_cluster/health?pretty=true&wait_for_status=yellow&timeout=30s' is blocking for almost 30s? If that's the case than it seems to do exactly what its intended to but I should increase the max-timeout to 60s to be safe :thinking: .

I always thought the call that follows in the run-sepia.sh would ensure that ES is ok (it reads the user index mapping), but it seems this can return successfully without actually being in "yellow" or "green" state. Maybe it is still reading files from the shared folder in this state.

Btw did you see the notes on Docker containers, especially the part about virtual memory? It ensures that ES will run stable and might increase performance as well.

BieJay93 commented 2 years ago

Yup, that's right. Not exactly 30s but around 20-30sec, it differs a bit from run to run. Okay perfect, the only disadvantage of this method is, that the startup is taking some more time. But I think that's acceptable.

And yes,vm.max_map_count is set. In my two days of struggling around with this problem I really tried everything 😅

fquirin commented 2 years ago

Okay perfect, the only disadvantage of this method is, that the startup is taking some more time

That's the weird part somehow. The cluster health request should return immediately when the status becomes yellow or green ... or throw an error if it takes longer than '&timeout=30s'. I'll do some additional testing tomorrow.

In my two days of struggling around with this problem I really tried everything

What host machine do you use? I've heard some people complaining about Docker on Mac for example because the share file-system can be extremely slow :-|

BieJay93 commented 2 years ago

Okay, that's strange, because I also tested &timeout=3, without any errors..

I'm using Debian as host machine. With an Intel i7 6700 and 48GB RAM, so performance shouldn't be the problem.

fquirin commented 2 years ago

Okay, that's strange, because I also tested &timeout=3, without any errors..

if you've used exactly that timeout=3 than it might have recognized the value as invalid and used the default (=30s). The value has to be given together with the unit 3000ms, 3s, 1m, 1h, ...

With an Intel i7 6700 and 48GB RAM, so performance shouldn't be the problem.

that should indeed be pretty fast :rocket:

BieJay93 commented 2 years ago

New day, new try. I wasn't really sure if i used it with the unit, so i tested again. This time definitely with the unit. I tried 3s, 6s, 10s & 20s. No timeouts and no big differences in the startup time. Best value was 10s, with a response time of 15s. Maybe coincidence.

I hope it helps you.

fquirin commented 2 years ago

Thanks for the info. Its really confusing and totally not what the health check promises :sweat_smile:, but at least it seams to start up reliably now right? I'll keep an eye on that and see how it behaves when I build the next test version.

BieJay93 commented 2 years ago

Yes, server is always starting up with this check. I'm already using this in my main sepia container. Many thanks :)

fquirin commented 2 years ago

The changes are integrated into SEPIA-Home v2.6.1 :slightly_smiling_face: - New Docker containers are not there yet but the old ones can simply be updated.