Open cmoussa1 opened 4 years ago
whats the dbpath you are using?
whats the dbpath you are using?
dbpath
is /run/flux/jobs.db
I'm confident that the data can persist in the db after the flux instance is brought down. As an experiment, tried this:
--- a/t/t2220-job-archive.t
+++ b/t/t2220-job-archive.t
@@ -8,7 +8,7 @@ export FLUX_CONF_DIR=$(pwd)
test_under_flux 4 job
ARCHIVEDIR=`pwd`
-ARCHIVEDB="${ARCHIVEDIR}/jobarchive.db"
+ARCHIVEDB="/tmp/achu/jobarchive.db"
So basically instead of creating a new file each time the tests are run, set a path where data from a prior test will be loaded. First test run succeeds, second test run sees the prior data and tests fail.
For this issue, the question is if the db might be removed somehow at the end of the instance running? Or if the instance is shut down in some way that some data is not flushed to disk?
Asked @cmoussa1 if he could try storing the db someplace "safe" as a first test. Given its docker and systemd
, not sure if the path /run/flux
is a safe place to store a db. (i.e. directory could be wiped after instance stops running)
Ah, yeah, content-store is kept under an alternate path in this environment (I think I hinted to @cmoussa1 that job-archive db should go in rundir
but that probably wasn't correct).
content.backing-path /usr/var/lib/flux/content.sqlite
Sorry for the misdirection!
Our docker image probably needs to be fixed though, it should be /var/lib
not /usr/var/lib
. Probably need a --localstatedir=/var
added to default configure args.
I was able to point the job-archive DB to /usr/var/lib/flux/
, where content.sqlite
is located. Jobs do in fact remain in the DB after a restart:
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------- ---------------- ---------------- ---------------- -----------------------------------------------------------------------------
2201 1553133993984 1594921654.80753 1594921654.82732 1594921654.89485 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001 1260606455808 1594921637.3711 1594921637.389 1594921637.44179 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 1238141763584 1594921636.03251 1594921636.05075 1594921636.12109 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
Then after a systemctl restart flux
:
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------- ---------------- ---------------- ---------------- -----------------------------------------------------------------------------
2201 1553133993984 1594921654.80753 1594921654.82732 1594921654.89485 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001 1260606455808 1594921637.3711 1594921637.389 1594921637.44179 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 1238141763584 1594921636.03251 1594921636.05075 1594921636.12109 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
The one thing I noticed is that root privileges are required to access this database here. Maybe this is expected, and the behavior we want with access to the job-archive DB. With SQLite, I believe the user trying to open the database also needs access to the directory the database file is located in [1].
FWIW, I tried specifying a custom location elsewhere (i.e. I made a new directory under /
called /new-dir
and specified dbpath='/new-dir/jobs.db
), but I get a cmb.insmod: No such file or directory
error.
[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory
@cmoussa1, the newest fluxorama image has the change to --localstatedir=/var
, so sqlite.db
content-cache will be found under /var/lib/flux
. You might want to process the relevant attr in your rc1
script in order to always place the job-archive db co-located with content cache.
Just confirmed this - thanks @grondo!
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------ ---------------- ---------------- ---------------- ---------------------------------------------------------------------------
2001 817520181248 1594931850.05394 1594931850.07371 1594931850.13204 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 799014912000 1594931848.95024 1594931848.97049 1594931849.03598 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite> .exit
[root@98a7bf4b57b6 ~]# systemctl restart flux
[root@98a7bf4b57b6 ~]# sqlite3 /var/lib/flux/jobs.db
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> .mode columns
sqlite> .headers on
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------ ---------------- ---------------- ---------------- ---------------------------------------------------------------------------
2001 817520181248 1594931850.05394 1594931850.07371 1594931850.13204 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 799014912000 1594931848.95024 1594931848.97049 1594931849.03598 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite>
Side question - do you think the bank/accounting database should also reside in /var/lib/flux
? I think this would mean only root would be able to interact with both databases (the job-archive DB and the bank/accounting DB)
Good question. I'm not sure how the bank/accounting DBs work. The utilities need direct rw access for now?
Yes, both DB's need direct rw
access, as well as ownership of the directory the database file resides in (this is behavior as a result of using SQLite).
Then it does seem like those DBs need to go in a different directory. Perhaps with group permissions for a new fluxadmin
group? At least for now...
Then it does seem like those DBs need to go in a different directory.
I think so too. The problem I've been running into, however, with trying to place the job-archive DB in a custom location is I am getting a cmb.insmod: No such file or directory error
, even when I create a new directory as root.
[root@844cdba731d4 ~]# mkdir /new-dir
[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory
FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath
option.
I wouldn't choose /new-dir
, since it isn't a directory that is going to exist on any Linux system. In the standard filesystem hierarchy, this DB probably also should exist under /var/lib
, so maybe /var/lib/flux-accounting
? Maybe for the example docker image, the directory should have flux.fluxadmin
permissions. Be sure to create the directory before you load the job-archive
module
More discussion probably needed on how this might work in a production situation.
FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath option.
Typically this error indicates there's a path issue, e.g. like /new-dir
doesn't exist. Although what you're doing above seems legit. Can you see what is in the flux broker logs?
Probably because the mkdir
is being done as root, and flux runs as the flux
user.
While setting up my version of the
fluxorama
Docker container to load bothflux-accounting
and thejob-archive
module, I noticed that the inactive jobs that get written to the .db file are not persistent after a system instance restart of Flux.My reproducer:
From within the container, I'll submit a couple of jobs:
These two jobs can be seen with both
flux jobs -A
and in thejob-archive
DB, whose location is in/run/flux/jobs.db
:After a
systemctl restart flux
, the two jobs will still show withflux jobs -A
, but not in thejob-archive
DB:The .db file gets a new modification time after the Flux instance is restarted, but there are no previously completed jobs written there anymore.
IMHO, I don't think it is a blocker for the testing environment I was planning on working with Ryan Day, but it's just something I noticed.