Closed falquaddoomi closed 1 month ago
Name | Link |
---|---|
Latest commit | df876b2f1aefebe19b6547b960fa99a34a056d4a |
Latest deploy log | https://app.netlify.com/sites/molevolvr/deploys/66feb1f3bfca46000865fbcf |
Deploy Preview | https://deploy-preview-11--molevolvr.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Is this a good time/place to rename /app
to /frontend
? Or should I make a new PR for that. Can we make this change soon before I continue work on the frontend to avoid conflict difficulties.
While looking at the run stack script, I went down an unnecessary rabbit hole of tweaking the comments and logic. I had some thoughts for improvement:
Please peruse the following as inspiration for changes:
Caveats:
You may also consider splitting up and/or naming the compose files differently. E.g., it's not clear what which services "override" will run (just from the name), or what scenario it applies to. Is it possible/prudent to have a single compose file for each service (slurm, caddy, frontend, db, backend, etc), then a compose file for each env that calls whichever ones are needed? Again, I'm speaking far outside my wheelhouse here, just something to consider.
Very minor thing, I noticed the frontend port is 5713
. Vite's default port is 5173
(which I just recently learned is leet-speak for "site"). Maybe change it to either match the default, or be something completely different like 8001
.
Frankly, yes, I realize there's a lot that can be rewritten about run_stack.sh
and a lot of it is out of date; it kind of grew organically and I haven't had time to update or reorganize it. I agree that a lot of the comments can be cut or simplified, and I'll use your comments for inspiration. Regarding your code changes, there are quite a few things that I assume you are simplifying for stylistic reasons that change the functionality of the code. Again, I'll use your changes for inspiration, but I'm going to leave in things that are 1) necessary for the stack's functionality, and 2) familiar to people who write shell scripts.
I do think it's a good idea to do a refactor, though, and perhaps now's the time for it. I'll add my changes to this PR, FYI.
You may also consider splitting up and/or naming the compose files differently. E.g., it's not clear what which services "override" will run (just from the name), or what scenario it applies to. Is it possible/prudent to have a single compose file for each service (slurm, caddy, frontend, db, backend, etc), then a compose file for each env that calls whichever ones are needed? Again, I'm speaking far outside my wheelhouse here, just something to consider.
Typically compose files are organized by environment; I'm diverging from that a bit with "slurm" not being an environment, but generally I'm following the standard practice of putting core services in a base docker-compose.yml
and then using compose files to further specify the services for different environments. What you described about putting individual services in individual compose files and then "calling" them from environment files isn't something I've seen done.
Regarding "override", I agree that it's a confusing name and I think it's a good idea to change it. The reason it's named that is because the compose file predates the run_stack.sh
script which explicitly specifies the compose files; Docker Compose will first use docker-compose.yml
by default and then docker-compose.override.yml
if it's present and no compose file has been specified, so it's often used as a "dev" compose file that, e.g., exposes ports for debugging. Since we expect people to use run_stack.sh
(I assume?) we don't need to rely on it being used by default; I'll name it docker-compose.dev.yml
, since it's for the dev environment.
Also, if you'd prefer not to see so many compose files in the repo root, I could put them in a subfolder, e.g. compose
or compose-envs
to get them out of the way.
All that sounds good to me. No need to hide the compose files in a folder.
Regarding the test failure, I'm looking into it. Somehow several of the latest actions were failing and I didn't see that in the past PRs..
Okay I forgot that the tests for this repo included Firefox and Safari, but the "install playwright" action I made only installed Chromium.
You can either ignore the failing test (and I'll fix this in an upcoming PR), or you can copy the latest action from this issue comment into /actions/install-playwright/action.yaml
.
@vincerubinetti FYI I just submitted the build cache fixes. run_stack.sh
will now first do a pull, which will take a little while to download the images, but should be nearly immediate when you run it again. It'll also run a build which should exploit the layer cache that's inlined in the images now and should complete very quickly. If it doesn't, and appears to be doing a full build rather than using the cache, let me know and I can look into it.
I made some minor changes to run_stack.sh
, but not nearly all the things you mentioned; I do think they're good ideas, but this PR's getting a little overloaded IMHO. I think I'll save the run_stack.sh
refactors for a future PR if that's ok with you.
Re-running it now, from a clean slate (Docker reset). It does appear to be building some stuff still. IIRC, you said somewhere that the Slurm part will still have to be built from scratch?
Here's where I'm at so far:
(base) Vincents-MacBook-Pro:molevolvr2.0 vincerubinetti$ ./run_stack.sh
* Inferred target environment: dev (via DEFAULT_ENV)
* Pulling images for dev (tag: dev)
[+] Pulling 85/7
✔ db Skipped - Image is already being pulled by dev-db 0.0s
✔ worker Skipped - Image is already being pulled by master 0.0s
✔ master Pulled 66.2s
✔ dev-db Pulled 10.4s
✔ backend Pulled 83.7s
✔ frontend Pulled 24.4s
✔ accounting Pulled 7.1s
* Building images for dev (tag: dev)
[+] Building 295.3s (40/54) docker:desktop-linux
=> => transferring context: 2B 0.0s
=> [backend] importing cache manifest from us-central1-docker.pkg.dev/cuhealthai-foundations 0.0s
=> [backend internal] load build context 0.0s
=> => transferring context: 47.60kB 0.0s
=> CACHED [backend backend-base 2/7] RUN apt-get update && apt-get install -y ccache 0.0s
=> CACHED [backend backend-base 3/7] RUN apt-get update && apt-get install -y curl 0.0s
=> CACHED [backend backend-base 4/7] RUN mkdir -p /tmp/software/ && wget -L -O /tmp/soft 0.0s
=> CACHED [backend backend-base 5/7] RUN curl -sSf https://atlasgo.sh | sh 0.0s
=> CACHED [backend backend-base 6/7] COPY ./docker/install.R /tmp/install.r 0.0s
=> CACHED [backend backend-base 7/7] RUN Rscript /tmp/install.r 0.2s
=> [backend backend-slurm 1/7] RUN curl -L -o envsubst "https://github.com/a8m/envsubst/ 0.9s
=> [backend backend-slurm 2/7] RUN groupadd -g 981 munge && useradd -m -c "MUNGE Uid 'N 0.3s
=> [backend backend-slurm 3/7] RUN apt-get update 5.0s
=> [backend backend-slurm 4/7] RUN DEBIAN_FRONTEND=noninteractive apt-get install -y mu 30.9s
=> [backend backend-slurm 5/7] RUN apt-get install -y wget gcc make bzip2 && cd /tmp 256.2s
=> => # libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -DNUMA_VERSION1_C
=> => # OMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-alias
=> => # ing -MT signal.lo -MD -MP -MF .deps/signal.Tpo -c signal.c -fPIC -DPIC -o .libs/signal.o
=> => # libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -DNUMA_VERSION1_C
=> => # OMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-alias
=> => # ing -MT signal.lo -MD -MP -MF .deps/signal.Tpo -c signal.c -o signal.o >/dev/null 2>&1
EDIT more logs now that it's finished:
dev-db-1 | The files belonging to this database system will be owned by user "postgres".
dev-db-1 | This user must also own the server process.
dev-db-1 |
dev-db-1 | The database cluster will be initialized with locale "en_US.utf8".
dev-db-1 | The default database encoding has accordingly been set to "UTF8".
dev-db-1 | The default text search configuration will be set to "english".
dev-db-1 |
dev-db-1 | Data page checksums are disabled.
dev-db-1 |
dev-db-1 | fixing permissions on existing directory /var/lib/postgresql/data ... ok
dev-db-1 | creating subdirectories ... ok
dev-db-1 | selecting dynamic shared memory implementation ... posix
dev-db-1 | selecting default max_connections ... 100
dev-db-1 | selecting default shared_buffers ... 128MB
dev-db-1 | selecting default time zone ... Etc/UTC
dev-db-1 | creating configuration files ... ok
dev-db-1 | running bootstrap script ... ok
dev-db-1 | performing post-bootstrap initialization ... ok
dev-db-1 | initdb: warning: enabling "trust" authentication for local connections
dev-db-1 | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
dev-db-1 | syncing data to disk ... ok
dev-db-1 |
dev-db-1 |
dev-db-1 | Success. You can now start the database server using:
dev-db-1 |
dev-db-1 | pg_ctl -D /var/lib/postgresql/data -l logfile start
dev-db-1 |
dev-db-1 | waiting for server to start....2024-10-03 15:58:40.254 UTC [48] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
dev-db-1 | 2024-10-03 15:58:40.255 UTC [48] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dev-db-1 | 2024-10-03 15:58:40.262 UTC [51] LOG: database system was shut down at 2024-10-03 15:58:39 UTC
dev-db-1 | 2024-10-03 15:58:40.265 UTC [48] LOG: database system is ready to accept connections
dev-db-1 | done
dev-db-1 | server started
dev-db-1 | CREATE DATABASE
dev-db-1 |
dev-db-1 |
dev-db-1 | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
dev-db-1 |
dev-db-1 | waiting for server to shut down....2024-10-03 15:58:40.454 UTC [48] LOG: received fast shutdown request
dev-db-1 | 2024-10-03 15:58:40.455 UTC [48] LOG: aborting any active transactions
dev-db-1 | 2024-10-03 15:58:40.457 UTC [48] LOG: background worker "logical replication launcher" (PID 54) exited with exit code 1
dev-db-1 | 2024-10-03 15:58:40.457 UTC [49] LOG: shutting down
dev-db-1 | 2024-10-03 15:58:40.458 UTC [49] LOG: checkpoint starting: shutdown immediate
dev-db-1 | 2024-10-03 15:58:40.525 UTC [49] LOG: checkpoint complete: wrote 922 buffers (5.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.048 s, total=0.069 s; sync files=301, longest=0.008 s, average=0.001 s; distance=4255 kB, estimate=4255 kB; lsn=0/1912080, redo lsn=0/1912080
dev-db-1 | 2024-10-03 15:58:40.528 UTC [48] LOG: database system is shut down
dev-db-1 | done
dev-db-1 | server stopped
dev-db-1 |
dev-db-1 | PostgreSQL init process complete; ready for start up.
dev-db-1 |
dev-db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
dev-db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
dev-db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: listening on IPv6 address "::", port 5432
dev-db-1 | 2024-10-03 15:58:40.572 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
dev-db-1 | 2024-10-03 15:58:40.575 UTC [64] LOG: database system was shut down at 2024-10-03 15:58:40 UTC
dev-db-1 | 2024-10-03 15:58:40.578 UTC [1] LOG: database system is ready to accept connections
db-1 | The files belonging to this database system will be owned by user "postgres".
db-1 | This user must also own the server process.
db-1 |
db-1 | The database cluster will be initialized with locale "en_US.utf8".
db-1 | The default database encoding has accordingly been set to "UTF8".
db-1 | The default text search configuration will be set to "english".
db-1 |
backend-1 | * Slurm enabled, configuring...
db-1 | Data page checksums are disabled.
db-1 |
db-1 | fixing permissions on existing directory /var/lib/postgresql/data ... ok
db-1 | creating subdirectories ... ok
db-1 | selecting dynamic shared memory implementation ... posix
db-1 | selecting default max_connections ... 100
db-1 | selecting default shared_buffers ... 128MB
db-1 | selecting default time zone ... Etc/UTC
db-1 | creating configuration files ... ok
db-1 | running bootstrap script ... ok
db-1 | performing post-bootstrap initialization ... ok
db-1 | syncing data to disk ... ok
db-1 |
db-1 |
db-1 | Success. You can now start the database server using:
db-1 |
db-1 | pg_ctl -D /var/lib/postgresql/data -l logfile start
db-1 | initdb: warning: enabling "trust" authentication for local connections
db-1 | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
db-1 |
db-1 | waiting for server to start....2024-10-03 15:58:40.254 UTC [48] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
db-1 | 2024-10-03 15:58:40.255 UTC [48] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db-1 | 2024-10-03 15:58:40.261 UTC [51] LOG: database system was shut down at 2024-10-03 15:58:39 UTC
db-1 | 2024-10-03 15:58:40.265 UTC [48] LOG: database system is ready to accept connections
db-1 | done
db-1 | server started
db-1 | CREATE DATABASE
db-1 |
db-1 |
db-1 | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
db-1 |
db-1 | waiting for server to shut down....2024-10-03 15:58:40.454 UTC [48] LOG: received fast shutdown request
db-1 | 2024-10-03 15:58:40.455 UTC [48] LOG: aborting any active transactions
db-1 | 2024-10-03 15:58:40.457 UTC [48] LOG: background worker "logical replication launcher" (PID 54) exited with exit code 1
db-1 | 2024-10-03 15:58:40.458 UTC [49] LOG: shutting down
db-1 | 2024-10-03 15:58:40.459 UTC [49] LOG: checkpoint starting: shutdown immediate
db-1 | 2024-10-03 15:58:40.525 UTC [49] LOG: checkpoint complete: wrote 922 buffers (5.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.013 s, sync=0.049 s, total=0.067 s; sync files=301, longest=0.005 s, average=0.001 s; distance=4255 kB, estimate=4255 kB; lsn=0/1912080, redo lsn=0/1912080
db-1 | 2024-10-03 15:58:40.528 UTC [48] LOG: database system is shut down
db-1 | done
db-1 | server stopped
db-1 |
db-1 | PostgreSQL init process complete; ready for start up.
db-1 |
db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
db-1 | 2024-10-03 15:58:40.570 UTC [1] LOG: listening on IPv6 address "::", port 5432
db-1 | 2024-10-03 15:58:40.572 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db-1 | 2024-10-03 15:58:40.574 UTC [64] LOG: database system was shut down at 2024-10-03 15:58:40 UTC
db-1 | 2024-10-03 15:58:40.577 UTC [1] LOG: database system is ready to accept connections
master-1 | total 12
master-1 | -rw-r--r-- 1 root root 271 Sep 25 14:55 cgroup.conf.template
master-1 | -rw-r--r-- 1 root root 3099 Sep 25 14:55 slurm.conf.template
master-1 | -rw-r--r-- 1 root root 1010 Sep 25 14:55 slurmdbd.conf.template
master-1 | * Starting system message bus dbus
master-1 | ...done.
master-1 | * Starting MUNGE munged
master-1 | ...done.
master-1 | * Starting periodic command scheduler cron
master-1 | ...done.
master-1 | * Starting slurm-wlm database server interface
master-1 | ...done.
master-1 | slurmdbd (pid 91) is running...
master-1 | slurmdbd is running.
master-1 | * Starting slurm central management daemon slurmctld
master-1 | ...done.
master-1 | slurmdbd (pid 91) is running...
master-1 | slurmdbd is running.
master-1 | slurmctld is not running. Checking again in 5 seconds...
master-1 | * Starting slurm central management daemon slurmctld
master-1 | ...done.
master-1 | slurmctld is not running. Checking again in 5 seconds...
master-1 | * Starting slurm central management daemon slurmctld
master-1 | ...done.
master-1 | slurmctld (pid 177) is running...
accounting-1 | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.5+maria~ubu2204 started.
accounting-1 | 2024-10-03 15:58:39+00:00 [Warn] [Entrypoint]: /sys/fs/cgroup///memory.pressure not writable, functionality unavailable to MariaDB
master-1 | slurmctld is running.
master-1 |
master-1 | ===================================================================
master-1 | === Slurm setup complete! monitoring logs forever...
master-1 | ===================================================================
accounting-1 | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
accounting-1 | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.5+maria~ubu2204 started.
accounting-1 | 2024-10-03 15:58:39+00:00 [Note] [Entrypoint]: Initializing database files
accounting-1 | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Database files initialized
accounting-1 | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Starting temporary server
master-1 |
accounting-1 | 2024-10-03 15:58:40+00:00 [Note] [Entrypoint]: Waiting for server startup
master-1 | [2024-10-03T15:58:47.278] error: Could not open trigger state file /var/spool/slurmctld/trigger_state: No such file or directory
master-1 | [2024-10-03T15:58:47.278] error: NOTE: Trying backup state save file. Triggers may be lost!
master-1 | [2024-10-03T15:58:47.278] No trigger state file (/var/spool/slurmctld/trigger_state.old) to recover
master-1 | [2024-10-03T15:58:47.278] read_slurm_conf: backup_controller not specified
master-1 | [2024-10-03T15:58:47.278] Reinitializing job accounting state
master-1 | [2024-10-03T15:58:47.279] accounting_storage/slurmdbd: acct_storage_p_flush_jobs_on_cluster: Ending any jobs in accounting that were running when controller went down on
accounting-1 | 2024-10-03 15:58:40 0 [Note] Starting MariaDB 11.2.5-MariaDB-ubu2204 source revision dced6cbdb6932738c3a0a1fb435f3f64cb63851a server_uid exbDbOtyxu6in/wvzCArLuMuNus= as process 92
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
master-1 | [2024-10-03T15:58:47.279] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
master-1 | [2024-10-03T15:58:47.279] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
master-1 | [2024-10-03T15:58:47.280] Running as primary controller
master-1 | [2024-10-03T15:58:47.285] error: No fed_mgr state file (/var/spool/slurmctld/fed_mgr_state) to recover
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Number of transaction pools: 1
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Using ARMv8 crc32 + pmull instructions
accounting-1 | 2024-10-03 15:58:40 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Using liburing
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Initializing buffer pool, total size = 1.000GiB, chunk size = 16.000MiB
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Completed initialization of buffer pool
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: End of log at LSN=46300
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Opened 3 undo tablespaces
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
accounting-1 | 2024-10-03 15:58:40 0 [Note] InnoDB: log sequence number 46300; transaction id 14
accounting-1 | 2024-10-03 15:58:40 0 [Note] Plugin 'FEEDBACK' is disabled.
accounting-1 | 2024-10-03 15:58:40 0 [Note] Plugin 'wsrep-provider' is disabled.
accounting-1 | 2024-10-03 15:58:40 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
accounting-1 | 2024-10-03 15:58:40 0 [Note] mariadbd: ready for connections.
accounting-1 | Version: '11.2.5-MariaDB-ubu2204' socket: '/run/mysqld/mysqld.sock' port: 0 mariadb.org binary distribution
accounting-1 | 2024-10-03 15:58:41+00:00 [Note] [Entrypoint]: Temporary server started.
accounting-1 | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Creating database slurm_acct_db
accounting-1 | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Creating user slurmdbd
accounting-1 | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Giving user slurmdbd access to schema slurm_acct_db
accounting-1 | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Securing system users (equivalent to running mysql_secure_installation)
accounting-1 |
accounting-1 | 2024-10-03 15:58:42+00:00 [Note] [Entrypoint]: Stopping temporary server
accounting-1 | 2024-10-03 15:58:42 0 [Note] mariadbd (initiated by: unknown): Normal shutdown
accounting-1 | 2024-10-03 15:58:42 0 [Note] InnoDB: FTS optimize thread exiting.
accounting-1 | 2024-10-03 15:58:42 0 [Note] InnoDB: Starting shutdown...
accounting-1 | 2024-10-03 15:58:42 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
accounting-1 | 2024-10-03 15:58:42 0 [Note] InnoDB: Buffer pool(s) dump completed at 241003 15:58:42
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Shutdown completed; log sequence number 47875; transaction id 15
accounting-1 | 2024-10-03 15:58:43 0 [Note] mariadbd: Shutdown complete
accounting-1 |
accounting-1 | 2024-10-03 15:58:43+00:00 [Note] [Entrypoint]: Temporary server stopped
accounting-1 |
accounting-1 | 2024-10-03 15:58:43+00:00 [Note] [Entrypoint]: MariaDB init process done. Ready for start up.
accounting-1 |
accounting-1 | 2024-10-03 15:58:43 0 [Note] Starting MariaDB 11.2.5-MariaDB-ubu2204 source revision dced6cbdb6932738c3a0a1fb435f3f64cb63851a server_uid exbDbOtyxu6in/wvzCArLuMuNus= as process 1
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Number of transaction pools: 1
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Using ARMv8 crc32 + pmull instructions
accounting-1 | 2024-10-03 15:58:43 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Using liburing
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Initializing buffer pool, total size = 1.000GiB, chunk size = 16.000MiB
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Completed initialization of buffer pool
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: End of log at LSN=47875
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Opened 3 undo tablespaces
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: log sequence number 47875; transaction id 16
accounting-1 | 2024-10-03 15:58:43 0 [Note] Plugin 'FEEDBACK' is disabled.
accounting-1 | 2024-10-03 15:58:43 0 [Note] Plugin 'wsrep-provider' is disabled.
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
accounting-1 | 2024-10-03 15:58:43 0 [Note] InnoDB: Buffer pool(s) load completed at 241003 15:58:43
accounting-1 | 2024-10-03 15:58:43 0 [Note] Server socket created on IP: '0.0.0.0'.
accounting-1 | 2024-10-03 15:58:43 0 [Note] Server socket created on IP: '::'.
accounting-1 | 2024-10-03 15:58:43 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
accounting-1 | 2024-10-03 15:58:43 0 [Note] mariadbd: ready for connections.
accounting-1 | Version: '11.2.5-MariaDB-ubu2204' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
worker-1 | total 12
worker-1 | -rw-r--r-- 1 root root 271 Sep 25 14:55 cgroup.conf.template
worker-1 | -rw-r--r-- 1 root root 3099 Sep 25 14:55 slurm.conf.template
worker-1 | -rw-r--r-- 1 root root 1010 Sep 25 14:55 slurmdbd.conf.template
worker-1 | * Starting system message bus dbus
worker-1 | ...done.
worker-1 | * Starting MUNGE munged
worker-1 | ...done.
worker-1 | * Starting periodic command scheduler cron
worker-1 | ...done.
worker-1 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
worker-1 | LocalQ* up infinite 1 idle worker-dev
worker-1 | * Starting slurm compute node daemon slurmd
worker-1 | ...done.
worker-1 | slurmd (pid 95) is running...
worker-1 | slurmd is running.
worker-1 |
worker-1 | ===================================================================
worker-1 | === Slurm setup complete! monitoring logs forever...
worker-1 | ===================================================================
worker-1 |
worker-1 | [2024-10-03T15:58:47.958] error: Controller cpuset is not enabled!
worker-1 | [2024-10-03T15:58:47.960] error: Controller memory is not enabled!
worker-1 | [2024-10-03T15:58:47.960] error: Controller cpu is not enabled!
worker-1 | [2024-10-03T15:58:47.966] error: Controller cpuset is not enabled!
worker-1 | [2024-10-03T15:58:47.966] error: Controller memory is not enabled!
worker-1 | [2024-10-03T15:58:47.966] error: Controller cpu is not enabled!
worker-1 | [2024-10-03T15:58:47.980] CPU frequency setting not configured for this node
worker-1 | [2024-10-03T15:58:47.988] slurmd version 24.05.1 started
worker-1 | [2024-10-03T15:58:47.993] slurmd started on Thu, 03 Oct 2024 15:58:47 +0000
worker-1 | [2024-10-03T15:58:47.996] CPUs=10 Boards=1 Sockets=10 Cores=1 Threads=1 Memory=7840 TmpDisk=59767 Uptime=1803 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
backend-1 | * Starting MUNGE munged
frontend-1 | Re-optimizing dependencies because vite config has changed
frontend-1 |
frontend-1 | VITE v5.4.8 ready in 278 ms
frontend-1 |
frontend-1 | ➜ Local: http://localhost:5173/
frontend-1 | ➜ Network: http://172.18.0.8:5173/
backend-1 | ...done.
backend-1 | * Running schema migrations, if any are available...
backend-1 | Migrating to version 20240911014316 (3 migrations in total):
backend-1 |
backend-1 | -- migrating version 20240715182613
backend-1 | -> CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
backend-1 | -- ok (15.599625ms)
backend-1 |
backend-1 | -- migrating version 20240718152036
backend-1 | -> CREATE TYPE "status" AS ENUM ('submitted', 'analyzing', 'complete', 'error');
backend-1 | -> CREATE TABLE "analyses" ("id" character varying NOT NULL DEFAULT "right"(((uuid_generate_v4())::character varying)::text, 6), "name" character varying NOT NULL, "type" character varying NOT NULL, "info" json NULL, "created" timestamptz NOT NULL DEFAULT now(), "started" timestamptz NULL, "completed" timestamptz NULL, "status" "status" NOT NULL DEFAULT 'submitted', PRIMARY KEY ("id"));
backend-1 | -> CREATE TABLE "users" ("id" bigserial NOT NULL, "name" character varying NOT NULL, "created" timestamptz NOT NULL DEFAULT now());
backend-1 | -> CREATE TABLE "analysis_event" ("id" bigserial NOT NULL, "analysis_id" character varying NOT NULL, "event" character varying NOT NULL, "info" text NULL, "created" timestamptz NOT NULL DEFAULT now(), PRIMARY KEY ("id"), CONSTRAINT "analysis_fk" FOREIGN KEY ("analysis_id") REFERENCES "analyses" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION);
backend-1 | -- ok (17.227ms)
backend-1 |
backend-1 | -- migrating version 20240911014316
backend-1 | -> ALTER TABLE "analyses" ADD COLUMN "reason" text NULL;
backend-1 | -- ok (1.036125ms)
backend-1 |
backend-1 | -------------------------
backend-1 | -- 98.953334ms
backend-1 | -- 3 migrations
backend-1 | -- 6 sql statements
backend-1 | [app] plumbing...
backend-1 | [app] running: Rscript /app/entrypoint.R
backend-1 | [app] watching...
backend-1 | Running plumber API at http://0.0.0.0:9050
backend-1 | Running swagger Docs at http://127.0.0.1:9050/__docs__/
db-1 | 2024-10-03 16:03:40.615 UTC [62] LOG: checkpoint starting: time
dev-db-1 | 2024-10-03 16:03:40.620 UTC [62] LOG: checkpoint starting: time
dev-db-1 | 2024-10-03 16:03:44.999 UTC [62] LOG: checkpoint complete: wrote 45 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.355 s, sync=0.011 s, total=4.380 s; sync files=12, longest=0.009 s, average=0.001 s; distance=260 kB, estimate=260 kB; lsn=0/19534A0, redo lsn=0/1953468
master-1 | [2024-10-03T16:03:47.007] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
master-1 | [2024-10-03T16:03:47.008] error: NOTE: Trying backup state save file. Jobs may be lost!
master-1 | [2024-10-03T16:03:47.008] No job state file (/var/spool/slurmctld/job_state.old) found
db-1 | 2024-10-03 16:03:55.066 UTC [62] LOG: checkpoint complete: wrote 142 buffers (0.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=14.426 s, sync=0.011 s, total=14.451 s; sync files=85, longest=0.004 s, average=0.001 s; distance=590 kB, estimate=590 kB; lsn=0/19A5AB0, redo lsn=0/19A5A78
@vincerubinetti Thanks for the logs; it seems to be using some of the build cache, but not as much as I was hoping it would. Building SLURM is what I'm specifically trying to avoid with using the build cache, so the fact that it's still doing it is concerning. I'll have to keep looking into it. (On a side note, the runtime logs look like what I expect, so that's good at least.)
Did it seem like an infeasibly long time to build it? I ask because now that you've built it once locally, your build cache is definitely populated, so future builds will be quick. If the initial build time isn't too bad, it might not be worth spending the time to fix it. (Although, frankly, I'm curious now so I'll probably still look into it...)
Regarding this PR, it seems like things are mostly working, and the things you've requested can IMO be pushed to a future refactor PR. Do you think it's ready to merge, or were there things you wanted to see in this PR specifically that aren't in it?
I'd definitely like to see the Slurm build be skipped in the future, but for now it's definitely fine. I think it took like 10 min? I was doing other stuff so I'm not sure; maybe it could be determined from the logs I pasted.
Feel free to merge.
This PR builds on #5, with two main objectives:
In addition to the main objectives, this PR includes a few tweaks to API response handling and the testing framework. A new field,
reason
, has been added to theanalyses
table to capture why a status was set. Currently it gets set to an exception traceback if the analysis throws an error.The PR includes a skeleton of how analyses would be processed in
backend/api/dispatch/submit.R
(called viadispatchAnalysis()
inbackend/api/endpoints/analyses.R
), but it doesn't actually do any real work yet.Things to try:
./run_stack.sh shell
in a separate window, then run./run_tests.sh
, which will perform a few integration tests