ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

Cylc site DB errors #345

Open penguian opened 6 years ago

penguian commented 6 years ago

| by mrd599@nci.org.au


Jin has had several instances of a suite failing with an error message in the suite log

2017-10-18T14:02:31Z INFO - Suite shutting down - ERROR: database disk image is malformed

The "database disk image is malformed" message comes from sqlite3 trying to write to log/suite/db on accessdev.

However the suite can be restarted successfully, which means that the DB isn't really corrupted, unlike the problems we had with the rose-ana DB on raijin https://github.com/metomi/rose/issues/1897


Issue migrated from trac:345 at 2024-01-31 18:32:27 +1100

penguian commented 6 years ago

@martin.dix@anu.edu.au commented


Could this be related to the occasional very slow disk access on accessdev, perhaps a misreported timeout problem?

However error occurred at about midnight local time. NCI dashboard shows load average was < 5 then, c.f. peak of 25 in middle of day, so suggesting not load related.

penguian commented 6 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 6 years ago

@martin.dix@anu.edu.au commented


cylc has profiling tests. To run, copy cylc directory and modify dev/profile-experiments/complex.json to use batch-system=background.

Run with

cylc profile-battery --experiments complex

On acessdev-test

Version  Run            Elapsed Time (s)  CPU Time - Total (s)  Max Memory (kb)
HEAD     complex suite  3275.6            279.3                 83380.0       

Elapsed time is too large by factor of 2 (effect of dual CPU machine?).

DB reached about 6 MB, c.f. Jin's suite which reached 10 MB over a much longer period.

No DB problems on accessdev-test.