JRaviLab / molevolvr2.0

WIP new molevolvr app
https://molevolvr.netlify.app/
1 stars 10 forks source link

SLURM Integration #11

Closed falquaddoomi closed 1 month ago

falquaddoomi commented 1 month ago

This PR builds on #5, with two main objectives:

In addition to the main objectives, this PR includes a few tweaks to API response handling and the testing framework. A new field, reason, has been added to the analyses table to capture why a status was set. Currently it gets set to an exception traceback if the analysis throws an error.

The PR includes a skeleton of how analyses would be processed in backend/api/dispatch/submit.R (called via dispatchAnalysis() in backend/api/endpoints/analyses.R), but it doesn't actually do any real work yet.

Things to try:

netlify[bot] commented 1 month ago

Deploy Preview for molevolvr ready!

Name Link
Latest commit df876b2f1aefebe19b6547b960fa99a34a056d4a
Latest deploy log https://app.netlify.com/sites/molevolvr/deploys/66feb1f3bfca46000865fbcf
Deploy Preview https://deploy-preview-11--molevolvr.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

vincerubinetti commented 1 month ago

Is this a good time/place to rename /app to /frontend? Or should I make a new PR for that. Can we make this change soon before I continue work on the frontend to avoid conflict difficulties.

vincerubinetti commented 1 month ago

While looking at the run stack script, I went down an unnecessary rabbit hole of tweaking the comments and logic. I had some thoughts for improvement:

Please peruse the following as inspiration for changes:

run_stack re-arrange ```shell #!/usr/bin/env bash set -e # script to conveniently launch parts or all of the molevolvr stack in various ways # run like: # ./run_stack.sh [TARGET ENV] [COMPOSE ARGS] # TARGET ENV = target environment # COMPOSE ARGS = arguments passed through to docker compose # available envs: # prod # production env. # for running full stack, exactly as it will be deployed. # runs everything, including X, Y, Z. # most resource-intensive. # dev # development env. # for local development and testing. # runs X, Y, Z. # app # app development env. # mostly for frontend development, where you don't need to submit/query jobs # runs frontend and backend, no job scheduler or database. # envs differ in a variety of ways, including: # - which services they run (e.g. prod runs 'nginx', but dev doesn't) # - cores and memory constraints applied to SLURM containers (in envs # where job scheduler is enabled) # - what external resources they mount as volumes into container. e.g. every env # mounts a different job results folder, but envs that process jobs use same # blast and iprscan folders, since they're gigantic # these differences are implemented by using different docker compose files. # the "root" docker-compose.yml is used first, followed by other compose files # to start whatever services are needed for the env. # learn more: # https://docs.docker.com/compose/multiple-compose-files/merge/ # initialize variables to defaults # target environment TARGET_ENV=dev # whether to build docker image from scratch before starting stack BUILD_IMAGES=true # base docker compose command COMPOSE_CMD="" # docker compose arguments COMPOSE_ARGS="up -d logs -f" # command to run after stack has launched POST_LAUNCH_CMD=open_frontend # read .env file and export its contents as current env vars set -a source .env set +a # check if first script arg is valid env if [[ $1 =~ ^(prod|dev|app)$ ]]; then TARGET_ENV=$1 # consume first script arg, making second script arg (compose args) first shift fi echo "Selected target env: ${TARGET_ENV}" case ${TARGET_ENV} in "prod") COMPOSE_CMD="docker compose -f docker-compose.yml -f docker-compose.slurm.yml -f docker-compose.prod.yml" ;; "dev") COMPOSE_CMD="docker compose -f docker-compose.yml -f docker-compose.slurm.yml -f docker-compose.override.yml" ;; "app") COMPOSE_CMD="docker compose -f docker-compose.yml -f docker-compose.override.yml" ;; *) echo "ERROR: Unknown target env: ${TARGET_ENV}" exit 1 ;; esac if [ "$1" == "shell" ]; then COMPOSE_ARGS="exec backend /bin/bash" POST_LAUNCH_CMD="" # get docker compose args from script args elif [[ $1 =~ .+ ]]; then COMPOSE_ARGS="$@" POST_LAUNCH_CMD="" fi # export vars so docker compose can see them, so it can e.g. namespace hosts to their env export TARGET_ENV=${TARGET_ENV} if [ "${BUILD_IMAGES}" -eq "true" ]; then echo "Building images" ${COMPOSE_CMD} build fi # func to open frontend in browser tab, cross-platform function open_frontend() { # URL to open when we invoke browser FRONTEND_URL=${FRONTEND_URL:-"http://localhost:5713"} if [[ "$OSTYPE" == "linux-gnu"* ]]; then xdg-open $FRONTEND_URL elif [[ "$OSTYPE" == "darwin"* ]]; then open $FRONTEND_URL elif [[ "$OSTYPE" == "msys" || "$OSTYPE" == "win32" ]]; then explorer $FRONTEND_URL else echo "WARNING: Unsupported OS: $OSTYPE, unable to open browser" fi } # run commands echo "Running ${COMPOSE_CMD} ${COMPOSE_ARGS}" ${COMPOSE_CMD} ${COMPOSE_ARGS} echo "Running ${POST_LAUNCH_CMD}" ${POST_LAUNCH_CMD} ```

Caveats:

You may also consider splitting up and/or naming the compose files differently. E.g., it's not clear what which services "override" will run (just from the name), or what scenario it applies to. Is it possible/prudent to have a single compose file for each service (slurm, caddy, frontend, db, backend, etc), then a compose file for each env that calls whichever ones are needed? Again, I'm speaking far outside my wheelhouse here, just something to consider.

vincerubinetti commented 1 month ago

Very minor thing, I noticed the frontend port is 5713. Vite's default port is 5173 (which I just recently learned is leet-speak for "site"). Maybe change it to either match the default, or be something completely different like 8001.

falquaddoomi commented 1 month ago

Frankly, yes, I realize there's a lot that can be rewritten about run_stack.sh and a lot of it is out of date; it kind of grew organically and I haven't had time to update or reorganize it. I agree that a lot of the comments can be cut or simplified, and I'll use your comments for inspiration. Regarding your code changes, there are quite a few things that I assume you are simplifying for stylistic reasons that change the functionality of the code. Again, I'll use your changes for inspiration, but I'm going to leave in things that are 1) necessary for the stack's functionality, and 2) familiar to people who write shell scripts.

I do think it's a good idea to do a refactor, though, and perhaps now's the time for it. I'll add my changes to this PR, FYI.

You may also consider splitting up and/or naming the compose files differently. E.g., it's not clear what which services "override" will run (just from the name), or what scenario it applies to. Is it possible/prudent to have a single compose file for each service (slurm, caddy, frontend, db, backend, etc), then a compose file for each env that calls whichever ones are needed? Again, I'm speaking far outside my wheelhouse here, just something to consider.

Typically compose files are organized by environment; I'm diverging from that a bit with "slurm" not being an environment, but generally I'm following the standard practice of putting core services in a base docker-compose.yml and then using compose files to further specify the services for different environments. What you described about putting individual services in individual compose files and then "calling" them from environment files isn't something I've seen done.

Regarding "override", I agree that it's a confusing name and I think it's a good idea to change it. The reason it's named that is because the compose file predates the run_stack.sh script which explicitly specifies the compose files; Docker Compose will first use docker-compose.yml by default and then docker-compose.override.yml if it's present and no compose file has been specified, so it's often used as a "dev" compose file that, e.g., exposes ports for debugging. Since we expect people to use run_stack.sh (I assume?) we don't need to rely on it being used by default; I'll name it docker-compose.dev.yml, since it's for the dev environment.

Also, if you'd prefer not to see so many compose files in the repo root, I could put them in a subfolder, e.g. compose or compose-envs to get them out of the way.

vincerubinetti commented 1 month ago

All that sounds good to me. No need to hide the compose files in a folder.

Regarding the test failure, I'm looking into it. Somehow several of the latest actions were failing and I didn't see that in the past PRs..

vincerubinetti commented 1 month ago

Okay I forgot that the tests for this repo included Firefox and Safari, but the "install playwright" action I made only installed Chromium.

You can either ignore the failing test (and I'll fix this in an upcoming PR), or you can copy the latest action from this issue comment into /actions/install-playwright/action.yaml.

falquaddoomi commented 1 month ago

@vincerubinetti FYI I just submitted the build cache fixes. run_stack.sh will now first do a pull, which will take a little while to download the images, but should be nearly immediate when you run it again. It'll also run a build which should exploit the layer cache that's inlined in the images now and should complete very quickly. If it doesn't, and appears to be doing a full build rather than using the cache, let me know and I can look into it.

I made some minor changes to run_stack.sh, but not nearly all the things you mentioned; I do think they're good ideas, but this PR's getting a little overloaded IMHO. I think I'll save the run_stack.sh refactors for a future PR if that's ok with you.

vincerubinetti commented 1 month ago

Re-running it now, from a clean slate (Docker reset). It does appear to be building some stuff still. IIRC, you said somewhere that the Slurm part will still have to be built from scratch?

Here's where I'm at so far:

(base) Vincents-MacBook-Pro:molevolvr2.0 vincerubinetti$ ./run_stack.sh 
* Inferred target environment: dev (via DEFAULT_ENV)
* Pulling images for dev (tag: dev)
[+] Pulling 85/7
 ✔ db Skipped - Image is already being pulled by dev-db                                        0.0s 
 ✔ worker Skipped - Image is already being pulled by master                                    0.0s 
 ✔ master Pulled                                                                              66.2s 
 ✔ dev-db Pulled                                                                              10.4s 
 ✔ backend Pulled                                                                             83.7s 
 ✔ frontend Pulled                                                                            24.4s 
 ✔ accounting Pulled                                                                           7.1s 
* Building images for dev (tag: dev)
[+] Building 295.3s (40/54)                                                    docker:desktop-linux
 => => transferring context: 2B                                                                0.0s
 => [backend] importing cache manifest from us-central1-docker.pkg.dev/cuhealthai-foundations  0.0s
 => [backend internal] load build context                                                      0.0s
 => => transferring context: 47.60kB                                                           0.0s
 => CACHED [backend backend-base 2/7] RUN apt-get update && apt-get install -y ccache          0.0s
 => CACHED [backend backend-base 3/7] RUN apt-get update && apt-get install -y curl            0.0s
 => CACHED [backend backend-base 4/7] RUN mkdir -p /tmp/software/ &&     wget -L -O /tmp/soft  0.0s
 => CACHED [backend backend-base 5/7] RUN curl -sSf https://atlasgo.sh | sh                    0.0s
 => CACHED [backend backend-base 6/7] COPY ./docker/install.R /tmp/install.r                   0.0s
 => CACHED [backend backend-base 7/7] RUN   Rscript /tmp/install.r                             0.2s
 => [backend backend-slurm 1/7] RUN curl -L -o envsubst     "https://github.com/a8m/envsubst/  0.9s
 => [backend backend-slurm 2/7] RUN groupadd -g 981 munge     && useradd  -m -c "MUNGE Uid 'N  0.3s
 => [backend backend-slurm 3/7] RUN apt-get update                                             5.0s
 => [backend backend-slurm 4/7] RUN DEBIAN_FRONTEND=noninteractive apt-get install -y     mu  30.9s
 => [backend backend-slurm 5/7] RUN apt-get install -y wget gcc make bzip2     && cd /tmp    256.2s
 => => # libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -DNUMA_VERSION1_C
 => => # OMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-alias
 => => # ing -MT signal.lo -MD -MP -MF .deps/signal.Tpo -c signal.c  -fPIC -DPIC -o .libs/signal.o 
 => => # libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -DNUMA_VERSION1_C
 => => # OMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-alias
 => => # ing -MT signal.lo -MD -MP -MF .deps/signal.Tpo -c signal.c -o signal.o >/dev/null 2>&1    

EDIT more logs now that it's finished:

dev-db-1      | The files belonging to this database system will be owned by user "postgres".
dev-db-1      | This user must also own the server process.
dev-db-1      | 
dev-db-1      | The database cluster will be initialized with locale "en_US.utf8".
dev-db-1      | The default database encoding has accordingly been set to "UTF8".
dev-db-1      | The default text search configuration will be set to "english".
dev-db-1      | 
dev-db-1      | Data page checksums are disabled.
dev-db-1      | 
dev-db-1      | fixing permissions on existing directory /var/lib/postgresql/data ... ok
dev-db-1      | creating subdirectories ... ok
dev-db-1      | selecting dynamic shared memory implementation ... posix
dev-db-1      | selecting default max_connections ... 100
dev-db-1      | selecting default shared_buffers ... 128MB
dev-db-1      | selecting default time zone ... Etc/UTC
dev-db-1      | creating configuration files ... ok
dev-db-1      | running bootstrap script ... ok
dev-db-1      | performing post-bootstrap initialization ... ok
dev-db-1      | initdb: warning: enabling "trust" authentication for local connections
dev-db-1      | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
dev-db-1      | syncing data to disk ... ok
dev-db-1      | 
dev-db-1      | 
dev-db-1      | Success. You can now start the database server using:
dev-db-1      | 
dev-db-1      |     pg_ctl -D /var/lib/postgresql/data -l logfile start
dev-db-1      | 
dev-db-1      | waiting for server to start....2024-10-03 15:58:40.254 UTC [48] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
dev-db-1      | 2024-10-03 15:58:40.255 UTC [48] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dev-db-1      | 2024-10-03 15:58:40.262 UTC [51] LOG:  database system was shut down at 2024-10-03 15:58:39 UTC
dev-db-1      | 2024-10-03 15:58:40.265 UTC [48] LOG:  database system is ready to accept connections
dev-db-1      |  done
dev-db-1      | server started
dev-db-1      | CREATE DATABASE
dev-db-1      | 
dev-db-1      | 
dev-db-1      | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
dev-db-1      | 
dev-db-1      | waiting for server to shut down....2024-10-03 15:58:40.454 UTC [48] LOG:  received fast shutdown request
dev-db-1      | 2024-10-03 15:58:40.455 UTC [48] LOG:  aborting any active transactions
dev-db-1      | 2024-10-03 15:58:40.457 UTC [48] LOG:  background worker "logical replication launcher" (PID 54) exited with exit code 1
dev-db-1      | 2024-10-03 15:58:40.457 UTC [49] LOG:  shutting down
dev-db-1      | 2024-10-03 15:58:40.458 UTC [49] LOG:  checkpoint starting: shutdown immediate
dev-db-1      | 2024-10-03 15:58:40.525 UTC [49] LOG:  checkpoint complete: wrote 922 buffers (5.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.048 s, total=0.069 s; sync files=301, longest=0.008 s, average=0.001 s; distance=4255 kB, estimate=4255 kB; lsn=0/1912080, redo lsn=0/1912080
dev-db-1      | 2024-10-03 15:58:40.528 UTC [48] LOG:  database system is shut down
dev-db-1      |  done
dev-db-1      | server stopped
dev-db-1      | 
dev-db-1      | PostgreSQL init process complete; ready for start up.
dev-db-1      | 
dev-db-1      | 2024-10-03 15:58:40.570 UTC [1] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
dev-db-1      | 2024-10-03 15:58:40.570 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
dev-db-1      | 2024-10-03 15:58:40.570 UTC [1] LOG:  listening on IPv6 address "::", port 5432
dev-db-1      | 2024-10-03 15:58:40.572 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dev-db-1      | 2024-10-03 15:58:40.575 UTC [64] LOG:  database system was shut down at 2024-10-03 15:58:40 UTC
dev-db-1      | 2024-10-03 15:58:40.578 UTC [1] LOG:  database system is ready to accept connections
db-1          | The files belonging to this database system will be owned by user "postgres".
db-1          | This user must also own the server process.
db-1          | 
db-1          | The database cluster will be initialized with locale "en_US.utf8".
db-1          | The default database encoding has accordingly been set to "UTF8".
db-1          | The default text search configuration will be set to "english".
db-1          | 
backend-1     | * Slurm enabled, configuring...
db-1          | Data page checksums are disabled.
db-1          | 
db-1          | fixing permissions on existing directory /var/lib/postgresql/data ... ok
db-1          | creating subdirectories ... ok
db-1          | selecting dynamic shared memory implementation ... posix
db-1          | selecting default max_connections ... 100
db-1          | selecting default shared_buffers ... 128MB
db-1          | selecting default time zone ... Etc/UTC
db-1          | creating configuration files ... ok
db-1          | running bootstrap script ... ok
db-1          | performing post-bootstrap initialization ... ok
db-1          | syncing data to disk ... ok
db-1          | 
db-1          | 
db-1          | Success. You can now start the database server using:
db-1          | 
db-1          |     pg_ctl -D /var/lib/postgresql/data -l logfile start
db-1          | initdb: warning: enabling "trust" authentication for local connections
db-1          | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
db-1          | 
db-1          | waiting for server to start....2024-10-03 15:58:40.254 UTC [48] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
db-1          | 2024-10-03 15:58:40.255 UTC [48] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db-1          | 2024-10-03 15:58:40.261 UTC [51] LOG:  database system was shut down at 2024-10-03 15:58:39 UTC
db-1          | 2024-10-03 15:58:40.265 UTC [48] LOG:  database system is ready to accept connections
db-1          |  done
db-1          | server started
db-1          | CREATE DATABASE
db-1          | 
db-1          | 
db-1          | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
db-1          | 
db-1          | waiting for server to shut down....2024-10-03 15:58:40.454 UTC [48] LOG:  received fast shutdown request
db-1          | 2024-10-03 15:58:40.455 UTC [48] LOG:  aborting any active transactions
db-1          | 2024-10-03 15:58:40.457 UTC [48] LOG:  background worker "logical replication launcher" (PID 54) exited with exit code 1
db-1          | 2024-10-03 15:58:40.458 UTC [49] LOG:  shutting down
db-1          | 2024-10-03 15:58:40.459 UTC [49] LOG:  checkpoint starting: shutdown immediate
db-1          | 2024-10-03 15:58:40.525 UTC [49] LOG:  checkpoint complete: wrote 922 buffers (5.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.049 s, total=0.067 s; sync files=301, longest=0.005 s, average=0.001 s; distance=4255 kB, estimate=4255 kB; lsn=0/1912080, redo lsn=0/1912080
db-1          | 2024-10-03 15:58:40.528 UTC [48] LOG:  database system is shut down
db-1          |  done
db-1          | server stopped
db-1          | 
db-1          | PostgreSQL init process complete; ready for start up.
db-1          | 
db-1          | 2024-10-03 15:58:40.570 UTC [1] LOG:  starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
db-1          | 2024-10-03 15:58:40.570 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
db-1          | 2024-10-03 15:58:40.570 UTC [1] LOG:  listening on IPv6 address "::", port 5432
db-1          | 2024-10-03 15:58:40.572 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db-1          | 2024-10-03 15:58:40.574 UTC [64] LOG:  database system was shut down at 2024-10-03 15:58:40 UTC
db-1          | 2024-10-03 15:58:40.577 UTC [1] LOG:  database system is ready to accept connections
master-1      | total 12
master-1      | -rw-r--r-- 1 root root  271 Sep 25 14:55 cgroup.conf.template
master-1      | -rw-r--r-- 1 root root 3099 Sep 25 14:55 slurm.conf.template
master-1      | -rw-r--r-- 1 root root 1010 Sep 25 14:55 slurmdbd.conf.template
master-1      |  * Starting system message bus dbus
master-1      |    ...done.
master-1      |  * Starting MUNGE munged
master-1      |    ...done.
master-1      |  * Starting periodic command scheduler cron
master-1      |    ...done.
master-1      |  * Starting slurm-wlm database server interface
master-1      |    ...done.
master-1      | slurmdbd (pid 91) is running...
master-1      | slurmdbd is running.
master-1      |  * Starting slurm central management daemon slurmctld
master-1      |    ...done.
master-1      | slurmdbd (pid 91) is running...
master-1      | slurmdbd is running.
master-1      | slurmctld is not running. Checking again in 5 seconds...
master-1      |  * Starting slurm central management daemon slurmctld
master-1      |    ...done.
master-1      | slurmctld is not running. Checking again in 5 seconds...
master-1      |  * Starting slurm central management daemon slurmctld
master-1      |    ...done.
master-1      | slurmctld (pid 177) is running...
accounting-1  | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.5+maria~ubu2204 started.
accounting-1  | 2024-10-03 15:58:39+00:00 [Warn] [Entrypoint]: /sys/fs/cgroup///memory.pressure not writable, functionality unavailable to MariaDB
master-1      | slurmctld is running.
master-1      | 
master-1      | ===================================================================
master-1      | === Slurm setup complete! monitoring logs forever...
master-1      | ===================================================================
accounting-1  | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
accounting-1  | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.5+maria~ubu2204 started.
accounting-1  | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Initializing database files
accounting-1  | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Database files initialized
accounting-1  | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Starting temporary server
master-1      | 
accounting-1  | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Waiting for server startup
master-1      | [2024-10-03T15:58:47.278] error: Could not open trigger state file /var/spool/slurmctld/trigger_state: No such file or directory
master-1      | [2024-10-03T15:58:47.278] error: NOTE: Trying backup state save file. Triggers may be lost!
master-1      | [2024-10-03T15:58:47.278] No trigger state file (/var/spool/slurmctld/trigger_state.old) to recover
master-1      | [2024-10-03T15:58:47.278] read_slurm_conf: backup_controller not specified
master-1      | [2024-10-03T15:58:47.278] Reinitializing job accounting state
master-1      | [2024-10-03T15:58:47.279] accounting_storage/slurmdbd: acct_storage_p_flush_jobs_on_cluster: Ending any jobs in accounting that were running when controller went down on
accounting-1  | 2024-10-03 15:58:40 0 [Note] Starting MariaDB 11.2.5-MariaDB-ubu2204 source revision dced6cbdb6932738c3a0a1fb435f3f64cb63851a server_uid exbDbOtyxu6in/wvzCArLuMuNus= as process 92
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
master-1      | [2024-10-03T15:58:47.279] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
master-1      | [2024-10-03T15:58:47.279] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
master-1      | [2024-10-03T15:58:47.280] Running as primary controller
master-1      | [2024-10-03T15:58:47.285] error: No fed_mgr state file (/var/spool/slurmctld/fed_mgr_state) to recover
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Number of transaction pools: 1
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Using ARMv8 crc32 + pmull instructions
accounting-1  | 2024-10-03 15:58:40 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Using liburing
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Initializing buffer pool, total size = 1.000GiB, chunk size = 16.000MiB
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Completed initialization of buffer pool
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: End of log at LSN=46300
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Opened 3 undo tablespaces
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
accounting-1  | 2024-10-03 15:58:40 0 [Note] InnoDB: log sequence number 46300; transaction id 14
accounting-1  | 2024-10-03 15:58:40 0 [Note] Plugin 'FEEDBACK' is disabled.
accounting-1  | 2024-10-03 15:58:40 0 [Note] Plugin 'wsrep-provider' is disabled.
accounting-1  | 2024-10-03 15:58:40 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
accounting-1  | 2024-10-03 15:58:40 0 [Note] mariadbd: ready for connections.
accounting-1  | Version: '11.2.5-MariaDB-ubu2204'  socket: '/run/mysqld/mysqld.sock'  port: 0  mariadb.org binary distribution
accounting-1  | 2024-10-03 15:58:41+00:00 [Note] [Entrypoint]: Temporary server started.
accounting-1  | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Creating database slurm_acct_db
accounting-1  | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Creating user slurmdbd
accounting-1  | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Giving user slurmdbd access to schema slurm_acct_db
accounting-1  | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Securing system users (equivalent to running mysql_secure_installation)
accounting-1  | 
accounting-1  | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Stopping temporary server
accounting-1  | 2024-10-03 15:58:42 0 [Note] mariadbd (initiated by: unknown): Normal shutdown
accounting-1  | 2024-10-03 15:58:42 0 [Note] InnoDB: FTS optimize thread exiting.
accounting-1  | 2024-10-03 15:58:42 0 [Note] InnoDB: Starting shutdown...
accounting-1  | 2024-10-03 15:58:42 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
accounting-1  | 2024-10-03 15:58:42 0 [Note] InnoDB: Buffer pool(s) dump completed at 241003 15:58:42
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Shutdown completed; log sequence number 47875; transaction id 15
accounting-1  | 2024-10-03 15:58:43 0 [Note] mariadbd: Shutdown complete
accounting-1  | 
accounting-1  | 2024-10-03 15:58:43+00:00 [Note] [Entrypoint]: Temporary server stopped
accounting-1  | 
accounting-1  | 2024-10-03 15:58:43+00:00 [Note] [Entrypoint]: MariaDB init process done. Ready for start up.
accounting-1  | 
accounting-1  | 2024-10-03 15:58:43 0 [Note] Starting MariaDB 11.2.5-MariaDB-ubu2204 source revision dced6cbdb6932738c3a0a1fb435f3f64cb63851a server_uid exbDbOtyxu6in/wvzCArLuMuNus= as process 1
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Number of transaction pools: 1
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Using ARMv8 crc32 + pmull instructions
accounting-1  | 2024-10-03 15:58:43 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Using liburing
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Initializing buffer pool, total size = 1.000GiB, chunk size = 16.000MiB
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Completed initialization of buffer pool
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: End of log at LSN=47875
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Opened 3 undo tablespaces
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: log sequence number 47875; transaction id 16
accounting-1  | 2024-10-03 15:58:43 0 [Note] Plugin 'FEEDBACK' is disabled.
accounting-1  | 2024-10-03 15:58:43 0 [Note] Plugin 'wsrep-provider' is disabled.
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
accounting-1  | 2024-10-03 15:58:43 0 [Note] InnoDB: Buffer pool(s) load completed at 241003 15:58:43
accounting-1  | 2024-10-03 15:58:43 0 [Note] Server socket created on IP: '0.0.0.0'.
accounting-1  | 2024-10-03 15:58:43 0 [Note] Server socket created on IP: '::'.
accounting-1  | 2024-10-03 15:58:43 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
accounting-1  | 2024-10-03 15:58:43 0 [Note] mariadbd: ready for connections.
accounting-1  | Version: '11.2.5-MariaDB-ubu2204'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
worker-1      | total 12
worker-1      | -rw-r--r-- 1 root root  271 Sep 25 14:55 cgroup.conf.template
worker-1      | -rw-r--r-- 1 root root 3099 Sep 25 14:55 slurm.conf.template
worker-1      | -rw-r--r-- 1 root root 1010 Sep 25 14:55 slurmdbd.conf.template
worker-1      |  * Starting system message bus dbus
worker-1      |    ...done.
worker-1      |  * Starting MUNGE munged
worker-1      |    ...done.
worker-1      |  * Starting periodic command scheduler cron
worker-1      |    ...done.
worker-1      | PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
worker-1      | LocalQ*      up   infinite      1   idle worker-dev
worker-1      |  * Starting slurm compute node daemon slurmd
worker-1      |    ...done.
worker-1      | slurmd (pid 95) is running...
worker-1      | slurmd is running.
worker-1      | 
worker-1      | ===================================================================
worker-1      | === Slurm setup complete! monitoring logs forever...
worker-1      | ===================================================================
worker-1      | 
worker-1      | [2024-10-03T15:58:47.958] error: Controller cpuset is not enabled!
worker-1      | [2024-10-03T15:58:47.960] error: Controller memory is not enabled!
worker-1      | [2024-10-03T15:58:47.960] error: Controller cpu is not enabled!
worker-1      | [2024-10-03T15:58:47.966] error: Controller cpuset is not enabled!
worker-1      | [2024-10-03T15:58:47.966] error: Controller memory is not enabled!
worker-1      | [2024-10-03T15:58:47.966] error: Controller cpu is not enabled!
worker-1      | [2024-10-03T15:58:47.980] CPU frequency setting not configured for this node
worker-1      | [2024-10-03T15:58:47.988] slurmd version 24.05.1 started
worker-1      | [2024-10-03T15:58:47.993] slurmd started on Thu, 03 Oct 2024 15:58:47 +0000
worker-1      | [2024-10-03T15:58:47.996] CPUs=10 Boards=1 Sockets=10 Cores=1 Threads=1 Memory=7840 TmpDisk=59767 Uptime=1803 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
backend-1     |  * Starting MUNGE munged
frontend-1    | Re-optimizing dependencies because vite config has changed
frontend-1    | 
frontend-1    |   VITE v5.4.8  ready in 278 ms
frontend-1    | 
frontend-1    |   ➜  Local:   http://localhost:5173/
frontend-1    |   ➜  Network: http://172.18.0.8:5173/
backend-1     |    ...done.
backend-1     | * Running schema migrations, if any are available...
backend-1     | Migrating to version 20240911014316 (3 migrations in total):
backend-1     | 
backend-1     |   -- migrating version 20240715182613
backend-1     |     -> CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
backend-1     |   -- ok (15.599625ms)
backend-1     | 
backend-1     |   -- migrating version 20240718152036
backend-1     |     -> CREATE TYPE "status" AS ENUM ('submitted', 'analyzing', 'complete', 'error');
backend-1     |     -> CREATE TABLE "analyses" ("id" character varying NOT NULL DEFAULT "right"(((uuid_generate_v4())::character varying)::text, 6), "name" character varying NOT NULL, "type" character varying NOT NULL, "info" json NULL, "created" timestamptz NOT NULL DEFAULT now(), "started" timestamptz NULL, "completed" timestamptz NULL, "status" "status" NOT NULL DEFAULT 'submitted', PRIMARY KEY ("id"));
backend-1     |     -> CREATE TABLE "users" ("id" bigserial NOT NULL, "name" character varying NOT NULL, "created" timestamptz NOT NULL DEFAULT now());
backend-1     |     -> CREATE TABLE "analysis_event" ("id" bigserial NOT NULL, "analysis_id" character varying NOT NULL, "event" character varying NOT NULL, "info" text NULL, "created" timestamptz NOT NULL DEFAULT now(), PRIMARY KEY ("id"), CONSTRAINT "analysis_fk" FOREIGN KEY ("analysis_id") REFERENCES "analyses" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION);
backend-1     |   -- ok (17.227ms)
backend-1     | 
backend-1     |   -- migrating version 20240911014316
backend-1     |     -> ALTER TABLE "analyses" ADD COLUMN "reason" text NULL;
backend-1     |   -- ok (1.036125ms)
backend-1     | 
backend-1     |   -------------------------
backend-1     |   -- 98.953334ms
backend-1     |   -- 3 migrations
backend-1     |   -- 6 sql statements
backend-1     | [app] plumbing... 
backend-1     | [app] running: Rscript /app/entrypoint.R 
backend-1     | [app] watching... 
backend-1     | Running plumber API at http://0.0.0.0:9050
backend-1     | Running swagger Docs at http://127.0.0.1:9050/__docs__/
db-1          | 2024-10-03 16:03:40.615 UTC [62] LOG:  checkpoint starting: time
dev-db-1      | 2024-10-03 16:03:40.620 UTC [62] LOG:  checkpoint starting: time
dev-db-1      | 2024-10-03 16:03:44.999 UTC [62] LOG:  checkpoint complete: wrote 45 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.355 s, sync=0.011 s, total=4.380 s; sync files=12, longest=0.009 s, average=0.001 s; distance=260 kB, estimate=260 kB; lsn=0/19534A0, redo lsn=0/1953468
master-1      | [2024-10-03T16:03:47.007] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
master-1      | [2024-10-03T16:03:47.008] error: NOTE: Trying backup state save file. Jobs may be lost!
master-1      | [2024-10-03T16:03:47.008] No job state file (/var/spool/slurmctld/job_state.old) found
db-1          | 2024-10-03 16:03:55.066 UTC [62] LOG:  checkpoint complete: wrote 142 buffers (0.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=14.426 s, sync=0.011 s, total=14.451 s; sync files=85, longest=0.004 s, average=0.001 s; distance=590 kB, estimate=590 kB; lsn=0/19A5AB0, redo lsn=0/19A5A78
falquaddoomi commented 1 month ago

@vincerubinetti Thanks for the logs; it seems to be using some of the build cache, but not as much as I was hoping it would. Building SLURM is what I'm specifically trying to avoid with using the build cache, so the fact that it's still doing it is concerning. I'll have to keep looking into it. (On a side note, the runtime logs look like what I expect, so that's good at least.)

Did it seem like an infeasibly long time to build it? I ask because now that you've built it once locally, your build cache is definitely populated, so future builds will be quick. If the initial build time isn't too bad, it might not be worth spending the time to fix it. (Although, frankly, I'm curious now so I'll probably still look into it...)

Regarding this PR, it seems like things are mostly working, and the things you've requested can IMO be pushed to a future refactor PR. Do you think it's ready to merge, or were there things you wanted to see in this PR specifically that aren't in it?

vincerubinetti commented 1 month ago

I'd definitely like to see the Slurm build be skipped in the future, but for now it's definitely fine. I think it took like 10 min? I was doing other stuff so I'm not sure; maybe it could be determined from the logs I pasted.

Feel free to merge.