Closed ryantm closed 3 years ago
Not sure if related, but the discourse preStart
script is insanely resource intensive: it seems to be doing an initialisation of sorts, the kind that I'd expect on the first run, but it does it every time the service is (re)started.
Is this intentional?
@ryantm I'm not able to reproduce this on master. It seems to fail during migration for you, but it's not clear to me why that is.
@rnhmjoj Not sure which step you mean. The heaviest thing preStart
does is probably the database migration, which has to be done on every start. Other than that, it creates an admin account (if it doesn't exist), updates themes and regenerates missing optimized images. I guess any rake task in a rails app could be considered comparatively resource intensive, though.
updates themes and regenerates missing optimized images
I think it was that: I remember seeing lots of imagemagick processes running in parallel. I had considered Discourse for a quick demo in a talk on NixOS, because it was taking >20min to start the service, in the end I opted for Jitsi Meet.
Oh, I see. Sorry to hear that.
The image optimization shouldn't do anything at all with a clean installation, since it only optimizes uploaded images. If it does, that sounds like a bug.
It's normal that it takes a while to start the service - database population and migration takes a while and Discourse is a rather heavy rails app. >20min sounds pretty excessive, though - the test completes in 6min 37s on my machine and that runs much slower than my test instance, an EC2 t3.small. Do you remember which hardware/VM you tried it on?
It was running in the usual Qemu VM generated with nixos-rebuild build-vm
, with a bit of tweaking to use more memory and 4 cores. I happen to still have the configuration for it, if you want to try. The host itself was running on a x86_64 machine with KVM, 4 cores and 8G RAM and no swap.
Running the following simplified config on my machine, Discourse starts in 4min 20s. Can you try it out?
{ pkgs, lib, ... }:
let
adminPassword = "eYAX85qmMJ5GZIHLaXGDAoszD7HSZp5d";
secretKeyBase = "381f4ac6d8f5e49d804dae72aa9c046431d2f34c656a705c41cd52fed9b4f6f76f51549f0b55db3b8b0dded7a00d6a381ebe9a4367d2d44f5e743af6628b4d42";
in
{
##
## System Settings
##
networking.hostName = "discourse-test";
# Firewall
networking.firewall.allowedTCPPorts = [ 80 ];
services.postgresql.package = pkgs.postgresql_13;
# Forum
services.discourse = {
enable = true;
hostname = "localhost";
secretKeyBaseFile = "${pkgs.writeText "secret-key-base" secretKeyBase}";
enableACME = false;
# Admin user
admin =
{ email = "test@test.org";
username = "admin";
fullName = "Administrator";
passwordFile = "${pkgs.writeText "admin-pass" adminPassword}";
};
# Forum settings
siteSettings = {
# Forum description
required =
{ title = "HackLab Demo";
site_description = "Discuss NixOS here (if the demo worked)";
};
};
};
# UNIX users
users.users.rnhmjoj =
{ description = "My user";
isNormalUser = true;
createHome = true;
extraGroups = [ "wheel" ]; # for sudo
password = ""; # don't do this irl
};
# To resize the serial console automatically
environment.systemPackages = [ pkgs.xterm ];
environment.shellInit = "resize > /dev/null";
##
## VM settings
##
imports = [ <nixpkgs/nixos/modules/virtualisation/qemu-vm.nix> ];
virtualisation = {
# Hardware
graphics = false;
cores = 4;
memorySize = "4000"; # MB
msize = 1048576;
qemu.networkingOptions = [
# Use SLIRP
"-net nic,netdev=user.0,model=virtio"
# Forward port 80/443
"-netdev user,id=user.0,hostfwd=tcp::80-:80"
];
};
}
To address both the original issue and possibly yours, @rnhmjoj: checking the timestamps in the log reveals that it's running extremely slow:
my desktop
discourse # [ 9.495618] systemd[1]: Finished discourse-postgresql.service.
discourse # [ 9.496474] systemd[1]: Starting discourse.service...
ofborg
discourse # [ 17.131409] systemd[1]: Finished discourse-postgresql.service.
discourse # [ 17.136082] systemd[1]: Starting discourse.service...
@ryantm's log
discourse # [ 78.778724] systemd[1]: Finished discourse-postgresql.service.
discourse # [ 78.802364] systemd[1]: Starting discourse.service...
This low performance points to virtualization extensions either being missing or disabled, unless the hardware is really old (but even my low-specced X1 Carbon from 2015 reaches this point in less than 30 seconds). That would explain why you're seeing the error from the log - it times out:
MiniRacer::ScriptTerminatedError: JavaScript was terminated (either by timeout or explicitly)
Just tested, ~18 minutes:
Sep 24 12:45:23 discourse-test systemd[1]: Starting discourse.service...
[...]
Sep 24 13:03:29 discourse-test systemd[1]: Started discourse.service.
The problem is not much having to wait for a while, but that it starts all over when you restart the service.
It doesn't seem to be CPU-bound at all. My idea is that it's slowed down by I/O through the 9pfs.
top - 13:20:19 up 35 min, 1 user, load average: 1.17, 1.41, 1.64
Tasks: 99 total, 1 running, 98 sleeping, 0 stopped, 0 zombie
%Cpu0 : 1.6/1.6 3[|| ]
%Cpu1 : 0.0/0.3 0[ ]
%Cpu2 : 0.0/0.4 0[ ]
%Cpu3 : 0.0/0.3 0[ ]
GiB Mem : 14.5/3.8 [|||||||||| ]
GiB Swap: 0.0/0.0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 162.7m 9.1m 6.8m S 0.0 0.2 0:01.07 systemd
525 root 20 0 37.3m 6.8m 6.1m S 0.0 0.2 0:00.26 `- systemd-journal
542 root 20 0 21.9m 6.6m 5.2m S 0.0 0.2 0:00.34 `- systemd-udevd
804 root 20 0 9.6m 5.6m 5.0m S 0.0 0.1 0:00.04 `- systemd-logind
834 message+ 20 0 7.4m 3.7m 3.2m S 0.0 0.1 0:00.05 `- dbus-daemon
889 root 20 0 217.3m 0.2m 0.2m S 0.0 0.0 0:00.01 `- agetty
890 root 20 0 219.4m 3.2m 2.8m S 0.0 0.1 0:00.02 `- login
1085 rnhmjoj 20 0 221.3m 6.0m 3.5m S 0.0 0.2 0:00.87 `- bash
2484 root 20 0 218.9m 3.4m 3.1m S 0.0 0.1 0:00.00 `- sudo
2485 root 20 0 223.7m 4.9m 4.4m S 0.0 0.1 0:00.00 `- systemctl
2486 root 20 0 8.7m 2.0m 1.6m S 0.0 0.1 0:00.00 `- syste+
2535 rnhmjoj 20 0 221.5m 3.5m 3.1m R 1.6 0.1 0:08.45 `- top
891 redis 20 0 271.6m 9.4m 7.1m S 0.3 0.2 0:16.98 `- redis-server
900 nginx 20 0 43.2m 9.3m 7.7m S 0.0 0.2 0:00.04 `- nginx
903 nginx 20 0 43.4m 2.5m 0.9m S 0.0 0.1 0:00.00 `- nginx
904 nginx 20 0 43.4m 2.6m 0.9m S 0.0 0.1 0:00.00 `- nginx
974 root 20 0 3.2m 1.9m 1.5m S 0.0 0.0 0:00.00 `- dhcpcd
1031 nscd 20 0 910.8m 2.6m 2.3m S 0.0 0.1 0:00.12 `- nscd
1047 postgres 20 0 285.5m 17.0m 15.9m S 0.0 0.4 0:00.08 `- postgres
1049 postgres 20 0 285.7m 13.9m 12.5m S 0.0 0.4 0:00.26 `- postgres
1050 postgres 20 0 285.6m 12.5m 11.2m S 0.0 0.3 0:00.30 `- postgres
1051 postgres 20 0 285.5m 4.5m 3.3m S 0.0 0.1 0:00.29 `- postgres
1052 postgres 20 0 286.0m 6.3m 4.8m S 0.0 0.2 0:00.03 `- postgres
1053 postgres 20 0 270.8m 3.9m 2.5m S 0.0 0.1 0:00.31 `- postgres
1054 postgres 20 0 286.0m 6.1m 4.7m S 0.0 0.2 0:00.00 `- postgres
1078 rnhmjoj 20 0 11.4m 7.2m 6.3m S 0.0 0.2 0:00.05 `- systemd
1079 rnhmjoj 20 0 164.3m 2.3m 0.0m S 0.0 0.1 0:00.00 `- (sd-pam)
2289 polkitu+ 20 0 2876.0m 23.3m 18.8m S 0.0 0.6 0:00.13 `- polkitd
2495 discour+ 20 0 218.2m 3.2m 2.7m S 0.0 0.1 0:00.00 `- discourse-pre-s
2533 discour+ 20 0 322.4m 47.7m 8.9m D 3.3 1.2 0:02.65 `- rake
2 root 20 0 0.0m 0.0m 0.0m S 0.0 0.0 0:00.00 kthreadd
@rnhmjoj Yeah, trying it out on my old laptop gave me a similar result where it took more than 20 minutes to start discourse. Building a normal qemu image using make-disk-image.nix
brings that time down to 3min 30s on the laptop and 1min 10s on my desktop, so it seems 9pfs is definitely to blame in your case. Handling many small files through 9pfs seems to be extremely cpu intensive. Checking the load on the host, one core is pegged at 100% and the guest shows the ruby process waiting on disk io. This is not a case of the service doing unnecessary or even duplicate work - it's just loading the ruby app and rake tasks through a very slow interface. When not using 9pfs, the postgresql process waiting on disk io becomes the main culprit in slowing down the service start, as expected.
However, I also know that when the virtualization extensions are missing, it gets much slower still. My desktop's BIOS disables them by default, which causes massive slowdowns. These slowdowns are enough to make tests, that would otherwise pass, fail. I think this is a much more likely explanation for @ryantm's issue, seeing as it takes ~10s for it to reach the point where it mounts /
, which is before it even attempts to mount the nix store or does anything through 9pfs. For reference, my old laptop reaches that point in ~2s.
If 9pfs is really the bottleneck increasing the msize
should improve the performance a bit.
This page suggests values as high as 100MiB. I haven't tried such large values, yet.
I've tried increasing msize
, and it doesn't help at all, unfortunately. Setting it to 100MiB even made performance slightly worse..
@talyz is correct. My SVM virtualization option was disabled in the BIOS. I think this is because I recently upgraded the BIOS firmware and all the settings were defaulted.
test script finished in 375.26s
Describe the bug
Discourse NixOS test hangs
Steps To Reproduce
Steps to reproduce the behavior:
log: https://gist.github.com/ryantm/2c60f2f041341afab0481fef56d30aa5
Expected behavior
Test completes with success exit code.
Notify maintainers
@talyz
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.Maintainer information: