Startup time for galaxy web container can take very long

pcm32 commented 1 year ago

First, let me say how great this setup has become, you have done a fantastic job guys!

On a more or less unmodified setup, I often observe that the galaxy web, job and workflow containers can take quite a while (above some 8 minutes) to start. Looking at the logs I often see them stucked in a couple of places:

sqlitedict INFO 2022-12-01 18:35:34,188 [pN:main,p:8,tN:MainThread] opening Sqlite table 'unnamed' in '/galaxy/server/database/cache/tool_cache/cache.sqlite'

galaxy.tools.data DEBUG 2022-12-01 18:42:43,645 [pN:main,p:8,tN:MainThread] Loaded tool data table 'q2view_display' from file '/galaxy/server/config/mutable/shed_tool_data_table_conf.xml'

to the point where the container is killed as it runs out of time given by the probes I presume.

Would this start be improved if less tools get installed? Or could this be an issue of a slow shared file system? Or maybe more RAM should be given to the container? Do you see this as well? Thanks!

afgane commented 1 year ago

This is an unfortunate known issue, although I just now realized we never documented it in this repo. And while we are experiencing the same slow-start scenario, we've not figured out the root cause yet. One of the hypotheses is that s3fs or S3-CSI is doing a GET for each reference file (or cache validation) and hence this is taking so long. We've tried mounting the ref data bucket with additional mount options to see if that might affect the Galaxy startup performance, but it has not. Without doing a deeper dive into s3fs, I don't think we'll get to the bottom of this. An alternative to s3fs is to go back CVMFS-CSI, but that means we'll have to update it to work with K8s 1.21+, which means we're basically taking on maintenance of that project because it seems unmaintained upstream.

This could possibly be partially relieved within Galaxy startup because it's seems Galaxy is waiting on something from the reference folder (possibly inspecting all the .loc files). A questions is whether it has to do this during startup. We've created a bucket that contains only Galaxy's .loc files to test this hypothesis but there seems to be a bug in the Helm chart that we haven't discovered yet and it is preventing us from actually testing this.

If you have any other ideas or make any discoveries, please share.

pcm32 commented 1 year ago

Do you see it stopping at the same places as I see it, or it varies for you? Thanks the insights @afgane .

pcm32 commented 1 year ago

You are using s3fs for tools and reference data I'm guessing? anything else?

pcm32 commented 1 year ago

The job and workflow pods normally make it after one or two restarts, however today web is not going through even after 4 or 5 restarts. Do you alter the current defaults for readiness / liveness to allow web to go through? I think it is killing it every 8 minutes or so.

pcm32 commented 1 year ago

Seen from the inside, the web container seems completely idle, not even IO waits:

top - 21:46:07 up 15 days,  6:33,  0 users,  load average: 1.59, 1.29, 1.15
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  1.5 sy,  0.3 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  16008.3 total,   1846.8 free,   2509.2 used,  11652.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13136.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 galaxy    20   0    2380    576    504 S   0.0   0.0   0:00.04 tini
      7 galaxy    20   0    2484    592    512 S   0.0   0.0   0:00.00 sh
      8 galaxy    20   0  903180 243924  51296 D   0.0   1.5   0:05.64 gunicorn
     22 galaxy    20   0    7164   3972   3368 S   0.0   0.0   0:00.03 bash
     31 galaxy    20   0    9984   3728   3176 R   0.0   0.0   0:00.00 top

pcm32 commented 1 year ago

There is always at least one web, job or workflow container that systematically doesn't satisfy its probes, and gets killed eventually with an error code 143 (nothing showing an error in the Galaxy logs). If I keep doing helm upgrades to the running instance, as the older pods go away, that intermediate one (which was trying to make it through) succeeds, but then one of the newer revision pods starts to struggle. Could it be a race condition for some resources?

pcm32 commented 1 year ago

Have you tried changing the s3 endopoint to something geographically closer? I see it is set by default to Asia Pacific (https://github.com/galaxyproject/galaxy-helm/blob/e5d9830f407239f0864c8ae8d467914a4e7a0c28/galaxy/values.yaml#L324). Can I use just any regional endpoint or is this available only there?

pcm32 commented 1 year ago

Also, when bringing it down, the CSI S3 PV doesn't come down cleanly, it stays in terminating for very long:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                                                     STORAGECLASS           REASON   AGE
galaxy-dev-refdata-gxy-pv                  30Gi       ROX            Delete           Terminating   default/galaxy-dev-refdata-gxy-data-pvc                   refdata-gxy-data                15d

pcm32 commented 1 year ago

So with eu-west-2 on the mount options and secret, I get CrashLoopBackOff on the 3 galaxy containers, so I guess the data is not available there.

afgane commented 1 year ago

The bucket with reference data is available only in ap-southeast so it won't work with any other endpoint. And it's only the ref data that is being fetched from the bucket; tool definitions are coming from a different (Google) bucket (via an Action defined here https://github.com/anvilproject/cvmfs-cloud-clone). Re. the data locality - we've launched instances in the same region as the data but with no observable difference in startup time.

A reason S3 CSI may not be coming down on delete is due to finalizers being added, by Helm I believe, on the PVC. If you remove that, the pods will come down.

And re. the places where the Galaxy pods pause - it's the same spot as you captured. What you describe is what we've experienced as well: after about the 3rd restart all the Galaxy pods come up.

ksuderman commented 1 year ago

The job and workflow pods normally make it after one or two restarts, however today web is not going through even after 4 or 5 restarts. Do you alter the current defaults for readiness / liveness to allow web to go through? I think it is killing it every 8 minutes or so.

Are you still seeing this? One or two restarts is expected (not ideal, but we see the same thing), but there is something else going wrong if the web handler doesn't come up after a few restarts. Can you tell if the initContainers have completed? ~8 minutes sound about right for the readiness probe to timeout.

pcm32 commented 1 year ago

I left it all night and the web container kept restarting. Then I turned off s3csi (I don't really need reference data on that instance). I will give it another go and report back.

pcm32 commented 1 year ago

But yes, the init containers had finished. The only error I could find was the 134 code that you could see in the container part of the kubectl describe, but I suspect that comes from the sigterm that the probe must be triggering. Thanks your taking the time to go through this guys.

pcm32 commented 1 year ago

I'm seeing the need to 2 restarts on an installation (and upgrades) that have no S3 CSI usage... stopping always on the sqlite reads. So maybe this is being produced by something else?

sqlitedict INFO 2022-12-05 14:10:09,371 [pN:main,p:9,tN:MainThread] opening Sqlite table 'unnamed' in '/galaxy/server/database/cache/tool_cache/cache.sqlite'

could it be something performance related from opening sqlite databases from shared file systems? Are those used concurrently by all web, job, workflow handlers? Or could we chuck them individually in local filesystems instead?

nuwang commented 1 year ago

@pcm32 You're using NFS ganesha right? To test this hypothesis, could you try mapping in a local host mount for /galaxy/server/database/cache/tool_cache/ using extraVolumes and extraVolumeMounts?

pcm32 commented 1 year ago

Sure, will give it a try and report. Yes, using NFS ganesha.

nuwang commented 1 year ago

@pcm32 This PR: https://github.com/galaxyproject/galaxy-helm/pull/396 should solve the container startup speed issue, but it's awaiting a merge of: https://github.com/CloudVE/galaxy-cvmfs-csi-helm/pull/16

pcm32 commented 1 year ago

@pcm32 You're using NFS ganesha right? To test this hypothesis, could you try mapping in a local host mount for /galaxy/server/database/cache/tool_cache/ using extraVolumes and extraVolumeMounts?

I noticed now I haven't yet tried this. This would require a different mount per container though (web, workflow, job at least). Can this be expressed in extraVolumeMounts? So far I have only seen shared file systems used here.

nuwang commented 1 year ago

Yes, it should it should be possible to use an emptyDir mount to create a container local mount?

galaxyproject / galaxy-helm

Startup time for galaxy web container can take very long #391