NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.05k stars 14.04k forks source link

Discourse NixOS test keeps restarting discourse #138844

Closed ryantm closed 3 years ago

ryantm commented 3 years ago

Describe the bug

Discourse NixOS test hangs

Steps To Reproduce

Steps to reproduce the behavior:

nix-build nixos/tests/discourse.nix

log: https://gist.github.com/ryantm/2c60f2f041341afab0481fef56d30aa5

Expected behavior

Test completes with success exit code.

Notify maintainers

@talyz

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[ryantm@home2:~/p/nixpkgs]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.14.3, NixOS, 21.11 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20210908_3c56f62`
 - channels(ryantm): `"master"`
 - nixpkgs: `/nix/store/7ag892xlbbr0n07jxc2ipi7m93jqq34j-source`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
rnhmjoj commented 3 years ago

Not sure if related, but the discourse preStart script is insanely resource intensive: it seems to be doing an initialisation of sorts, the kind that I'd expect on the first run, but it does it every time the service is (re)started. Is this intentional?

talyz commented 3 years ago

@ryantm I'm not able to reproduce this on master. It seems to fail during migration for you, but it's not clear to me why that is.

@rnhmjoj Not sure which step you mean. The heaviest thing preStart does is probably the database migration, which has to be done on every start. Other than that, it creates an admin account (if it doesn't exist), updates themes and regenerates missing optimized images. I guess any rake task in a rails app could be considered comparatively resource intensive, though.

rnhmjoj commented 3 years ago

updates themes and regenerates missing optimized images

I think it was that: I remember seeing lots of imagemagick processes running in parallel. I had considered Discourse for a quick demo in a talk on NixOS, because it was taking >20min to start the service, in the end I opted for Jitsi Meet.

talyz commented 3 years ago

Oh, I see. Sorry to hear that.

The image optimization shouldn't do anything at all with a clean installation, since it only optimizes uploaded images. If it does, that sounds like a bug.

It's normal that it takes a while to start the service - database population and migration takes a while and Discourse is a rather heavy rails app. >20min sounds pretty excessive, though - the test completes in 6min 37s on my machine and that runs much slower than my test instance, an EC2 t3.small. Do you remember which hardware/VM you tried it on?

rnhmjoj commented 3 years ago

It was running in the usual Qemu VM generated with nixos-rebuild build-vm, with a bit of tweaking to use more memory and 4 cores. I happen to still have the configuration for it, if you want to try. The host itself was running on a x86_64 machine with KVM, 4 cores and 8G RAM and no swap.

talyz commented 3 years ago

Running the following simplified config on my machine, Discourse starts in 4min 20s. Can you try it out?

{ pkgs, lib, ... }:

let
  adminPassword = "eYAX85qmMJ5GZIHLaXGDAoszD7HSZp5d";
  secretKeyBase = "381f4ac6d8f5e49d804dae72aa9c046431d2f34c656a705c41cd52fed9b4f6f76f51549f0b55db3b8b0dded7a00d6a381ebe9a4367d2d44f5e743af6628b4d42";
in
{
  ##
  ## System Settings
  ##

  networking.hostName = "discourse-test";

  # Firewall
  networking.firewall.allowedTCPPorts = [ 80 ];

  services.postgresql.package = pkgs.postgresql_13;

  # Forum
  services.discourse = {
    enable = true;
    hostname = "localhost";
    secretKeyBaseFile = "${pkgs.writeText "secret-key-base" secretKeyBase}";
    enableACME = false;

    # Admin user
    admin =
      { email = "test@test.org";
        username = "admin";
        fullName = "Administrator";
        passwordFile = "${pkgs.writeText "admin-pass" adminPassword}";
      };

    # Forum settings
    siteSettings = {

      # Forum description
      required =
        { title = "HackLab Demo";
          site_description = "Discuss NixOS here (if the demo worked)";
        };

    };

  };

  # UNIX users
  users.users.rnhmjoj =
    { description = "My user";
      isNormalUser = true;
      createHome = true;
      extraGroups = [ "wheel" ]; # for sudo
      password = "";  # don't do this irl
    };

  # To resize the serial console automatically
  environment.systemPackages = [ pkgs.xterm ];
  environment.shellInit = "resize > /dev/null";

  ##
  ## VM settings
  ##

  imports = [ <nixpkgs/nixos/modules/virtualisation/qemu-vm.nix> ];

  virtualisation = {
    # Hardware
    graphics = false;
    cores = 4;
    memorySize = "4000"; # MB
    msize = 1048576;

    qemu.networkingOptions = [
      # Use SLIRP
      "-net nic,netdev=user.0,model=virtio"
      # Forward port 80/443
      "-netdev user,id=user.0,hostfwd=tcp::80-:80"
    ];
  };

}
talyz commented 3 years ago

To address both the original issue and possibly yours, @rnhmjoj: checking the timestamps in the log reveals that it's running extremely slow:

This low performance points to virtualization extensions either being missing or disabled, unless the hardware is really old (but even my low-specced X1 Carbon from 2015 reaches this point in less than 30 seconds). That would explain why you're seeing the error from the log - it times out:

MiniRacer::ScriptTerminatedError: JavaScript was terminated (either by timeout or explicitly)
rnhmjoj commented 3 years ago

Just tested, ~18 minutes:

Sep 24 12:45:23 discourse-test systemd[1]: Starting discourse.service...
[...]
Sep 24 13:03:29 discourse-test systemd[1]: Started discourse.service.

The problem is not much having to wait for a while, but that it starts all over when you restart the service.

rnhmjoj commented 3 years ago

It doesn't seem to be CPU-bound at all. My idea is that it's slowed down by I/O through the 9pfs.

top - 13:20:19 up 35 min,  1 user,  load average: 1.17, 1.41, 1.64
Tasks:  99 total,   1 running,  98 sleeping,   0 stopped,   0 zombie
%Cpu0  :   1.6/1.6     3[||                                                                     ]
%Cpu1  :   0.0/0.3     0[                                                                       ]
%Cpu2  :   0.0/0.4     0[                                                                       ]
%Cpu3  :   0.0/0.3     0[                                                                       ]
GiB Mem : 14.5/3.8      [||||||||||                                                             ]
GiB Swap:  0.0/0.0      [                                                                       ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                   
      1 root      20   0  162.7m   9.1m   6.8m S   0.0   0.2   0:01.07 systemd                   
    525 root      20   0   37.3m   6.8m   6.1m S   0.0   0.2   0:00.26  `- systemd-journal       
    542 root      20   0   21.9m   6.6m   5.2m S   0.0   0.2   0:00.34  `- systemd-udevd         
    804 root      20   0    9.6m   5.6m   5.0m S   0.0   0.1   0:00.04  `- systemd-logind        
    834 message+  20   0    7.4m   3.7m   3.2m S   0.0   0.1   0:00.05  `- dbus-daemon           
    889 root      20   0  217.3m   0.2m   0.2m S   0.0   0.0   0:00.01  `- agetty                
    890 root      20   0  219.4m   3.2m   2.8m S   0.0   0.1   0:00.02  `- login                 
   1085 rnhmjoj   20   0  221.3m   6.0m   3.5m S   0.0   0.2   0:00.87      `- bash              
   2484 root      20   0  218.9m   3.4m   3.1m S   0.0   0.1   0:00.00          `- sudo          
   2485 root      20   0  223.7m   4.9m   4.4m S   0.0   0.1   0:00.00              `- systemctl 
   2486 root      20   0    8.7m   2.0m   1.6m S   0.0   0.1   0:00.00                  `- syste+
   2535 rnhmjoj   20   0  221.5m   3.5m   3.1m R   1.6   0.1   0:08.45          `- top           
    891 redis     20   0  271.6m   9.4m   7.1m S   0.3   0.2   0:16.98  `- redis-server          
    900 nginx     20   0   43.2m   9.3m   7.7m S   0.0   0.2   0:00.04  `- nginx                 
    903 nginx     20   0   43.4m   2.5m   0.9m S   0.0   0.1   0:00.00      `- nginx             
    904 nginx     20   0   43.4m   2.6m   0.9m S   0.0   0.1   0:00.00      `- nginx             
    974 root      20   0    3.2m   1.9m   1.5m S   0.0   0.0   0:00.00  `- dhcpcd                
   1031 nscd      20   0  910.8m   2.6m   2.3m S   0.0   0.1   0:00.12  `- nscd                  
   1047 postgres  20   0  285.5m  17.0m  15.9m S   0.0   0.4   0:00.08  `- postgres              
   1049 postgres  20   0  285.7m  13.9m  12.5m S   0.0   0.4   0:00.26      `- postgres          
   1050 postgres  20   0  285.6m  12.5m  11.2m S   0.0   0.3   0:00.30      `- postgres          
   1051 postgres  20   0  285.5m   4.5m   3.3m S   0.0   0.1   0:00.29      `- postgres          
   1052 postgres  20   0  286.0m   6.3m   4.8m S   0.0   0.2   0:00.03      `- postgres          
   1053 postgres  20   0  270.8m   3.9m   2.5m S   0.0   0.1   0:00.31      `- postgres          
   1054 postgres  20   0  286.0m   6.1m   4.7m S   0.0   0.2   0:00.00      `- postgres          
   1078 rnhmjoj   20   0   11.4m   7.2m   6.3m S   0.0   0.2   0:00.05  `- systemd               
   1079 rnhmjoj   20   0  164.3m   2.3m   0.0m S   0.0   0.1   0:00.00      `- (sd-pam)          
   2289 polkitu+  20   0 2876.0m  23.3m  18.8m S   0.0   0.6   0:00.13  `- polkitd               
   2495 discour+  20   0  218.2m   3.2m   2.7m S   0.0   0.1   0:00.00  `- discourse-pre-s       
   2533 discour+  20   0  322.4m  47.7m   8.9m D   3.3   1.2   0:02.65      `- rake              
      2 root      20   0    0.0m   0.0m   0.0m S   0.0   0.0   0:00.00 kthreadd                  
talyz commented 3 years ago

@rnhmjoj Yeah, trying it out on my old laptop gave me a similar result where it took more than 20 minutes to start discourse. Building a normal qemu image using make-disk-image.nix brings that time down to 3min 30s on the laptop and 1min 10s on my desktop, so it seems 9pfs is definitely to blame in your case. Handling many small files through 9pfs seems to be extremely cpu intensive. Checking the load on the host, one core is pegged at 100% and the guest shows the ruby process waiting on disk io. This is not a case of the service doing unnecessary or even duplicate work - it's just loading the ruby app and rake tasks through a very slow interface. When not using 9pfs, the postgresql process waiting on disk io becomes the main culprit in slowing down the service start, as expected.

However, I also know that when the virtualization extensions are missing, it gets much slower still. My desktop's BIOS disables them by default, which causes massive slowdowns. These slowdowns are enough to make tests, that would otherwise pass, fail. I think this is a much more likely explanation for @ryantm's issue, seeing as it takes ~10s for it to reach the point where it mounts /, which is before it even attempts to mount the nix store or does anything through 9pfs. For reference, my old laptop reaches that point in ~2s.

rnhmjoj commented 3 years ago

If 9pfs is really the bottleneck increasing the msize should improve the performance a bit. This page suggests values as high as 100MiB. I haven't tried such large values, yet.

talyz commented 3 years ago

I've tried increasing msize, and it doesn't help at all, unfortunately. Setting it to 100MiB even made performance slightly worse..

ryantm commented 3 years ago

@talyz is correct. My SVM virtualization option was disabled in the BIOS. I think this is because I recently upgraded the BIOS firmware and all the settings were defaulted.

test script finished in 375.26s