Switch to StatefulSet - Githubissues

nijel commented 3 years ago

I saw the official Helm chart the other day and one thing stood out — it models Weblate as a stateless Deployment rather than StatefulSet which is great for stateful services. As far as I know, Weblate is currently a stateful service and can't be scaled horizontally. We started using Weblate on Kubernetes way before the official Helm chart got released and we first modelled it as Deployment too but the upgrades were somewhat problematic and it kept failing when trying to re-attach persistent disk to the newly spun up container. We used the "Recreate" rollout strategy but it would still fail and then we switched over to StatefulSet and this issue's been gone ever since.

Anywya, the idea is, should we remodel Weblate as a StatefulService? Is there any specific reason why we're using the Deployment object? I'm assuming that you've already considered it and that there are some reasons that I might have not thought about.

Originally posted by @mareksuscak in https://github.com/WeblateOrg/weblate/discussions/4806

Yann-J commented 3 years ago

Good point... I can see the main pod has a volume attached, so indeed this wouldn't scale if we increase the replicas, unless the volume is mounted ReadWriteMany, so it seems the right way would indeed be to use a StatefulSet with each pod having their own volume.

However, when I look at what's persisted in this volume, it seems to be essentially the static result of the compilation step at startup... In that case, this volume doesn't really need to be persisted, it can very well be an emptyDir (just so it survives container crashes)...?

I think the only files that really need persistence are the secret file and ssh directory... but even then, they seem to be essentially read-only after an initial setup. This means they could be set up with a pre-install helm hook once, and then mounted as a single ReadOnlyMany volume by all pods of the same Deployment...

In general I think we should try to use Deployment as much as possible over StatefulSets, as every new volume has a cost...

nijel commented 3 years ago

It really contains data which is supposed to persist - user uploaded content (screenshots, fonts) or VCS data, see https://docs.weblate.org/en/latest/admin/config.html#data-dir for documentation. Some more insights on what is stored there is also available in https://github.com/WeblateOrg/weblate/issues/2984#issuecomment-526252515.

mareksuscak commented 3 years ago

Like @nijel pointed out above, the data directory does hold the user-generated data so StatefulSet would be more than appropriate. However, I'm not sure if Weblate can run multiple instances while maintaining consistency in user-generated data right now. In other words, would it correctly replicate all screenshots? Would each instance correctly synchronize all commits in a timely manner? I don't think we're quite there yet but please correct me if I'm wrong @nijel. That's the main reason for why this transition is on hold I'd say.

nijel commented 3 years ago

Yes, the filesystem has to be synchronous across Weblate instances

Yann-J commented 3 years ago

Ah yes of course... indeed, in this case, if replication is expected, unless the application manages it, switching to StatefulSet will not be enough. I would suggest to simply mention that scaling (setting more than 1 replicas) is only possible with ReadWriteMany volumes. I see this accessMode and the storage class is already configurable in values.yaml. I'm not sure that using more than 1 replicas would be a very common use case anyway, as one instance should probably already be able to sustain a fair workload... RWM volumes tend to be more expensive, so in this case we might want to limit it to the strict minimum of files that indeed have to be replicated across. Auto-generated statics probably shouldn't belong there (?).

bartusz01 commented 1 year ago

Hi, we tested running multiple replicas, but ran into an issue, the css file is not always found (depending on which container the traffic is directed to), I suppose it can be fixed with session affinity, but it seems to be related to the these css files being located in /app/cache/static/CACHE/css which is not synced between the containers like /app/data directory.

nijel commented 1 year ago

You should run the same version in all replicas; otherwise things will break. /app/cache/static/CACHE/css is filled during container startup and does not need to be synced.

bartusz01 commented 1 year ago

what version do you mean? Afaik all versions are equal between containers.

FYI, the css file names are different between containers, when I run ls /app/cache/static/CACHE/css, I get e.g. an output like output.82205c8x9f76.css output.79f6539f66c2.css, which is different in both containers, restarting a container will generate again different names. Meanwhile, in the browser I get a 404 on path "/static/CACHE/css/output.79f6539f66c2.css" if traffic is directed to the other container.

nijel commented 1 year ago

Hmm, I thought that django-compressor generates stable names. This should be fixed...

nijel commented 1 year ago

This particular issue should be addressed by https://github.com/WeblateOrg/weblate/commit/90fbea8a41019ccd992f44f87c8268d42494823a.

bartusz01 commented 1 year ago

Thanks for the quick fix! Just to confirm, are you sure that it is safe to run multiple replicas (with RWX pv)? Nothing bad can happen with concurrent writes or file locks for instance?

nijel commented 1 year ago

Yes, it's safe. All file system accesses are lock protected using Redis, no file locks are used for that.

zisuu commented 1 year ago

Hi @nijel

Thanks for the fix.

Do you maybe have an ETA until when this commit will be part of a new release? Currently we can not have Weblate HA on EKS because of this issue. With an active ChaosKube that randomly kills pods this is a nightmare. 😅

nijel commented 1 year ago

I've backported the patch to the Docker image in https://github.com/WeblateOrg/docker/commit/ec9086952ded6eb74394321086eefd89610f5a24, it will be available later today in bleeding and edge tags.

zisuu commented 1 year ago

is there any chance that this patch will also be released in a build with a version tag?

nijel commented 1 year ago

It was released in Weblate 4.18.2, so it's already there.

zisuu commented 1 year ago

Sorry I missed that. Awesome, thanks a lot

zisuu commented 1 year ago

fyi: switched to most recent helm chart and weblate version and it seems to work now. Can now run multiple replicas without running into this css bug anymore

WeblateOrg / helm

Switch to StatefulSet #54