NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.11k stars 14.15k forks source link

paperless-ngx: classifier training hangs & times out #240591

Closed benley closed 7 months ago

benley commented 1 year ago

Describe the bug

Directly related to this discussion: https://github.com/paperless-ngx/paperless-ngx/discussions/2373

paperless-ngx runs a hourly documents.tasks.train_classifier celery beat task. This is supposed to take a few minutes on most systems, but on NixOS (for some users, at least) the task runs forever and is eventually killed by the celery timeout. The affected worker process doesn't respond to SIGTERM or even SIGQUIT; only SIGKILL can interrupt it.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Set up paperless-ngx and add some documents to it
  2. Wait a while(?) or manually trigger the train_classifier task
  3. It times out after 30 minutes, probably.

Expected behavior

The classifier training task should finish successfully in a few minutes or less.

Screenshots

Additional context

See https://github.com/paperless-ngx/paperless-ngx/discussions/2373 for more background. This appears to be specific to paperless-ngx on NixOS.

Notify maintainers

@lukegb @gador @erikarvstedt @Flakebi

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

benley commented 1 year ago

a tiny bit more context:

The classifier task gets as far as printing these messages in /var/lib/paperless/log/paperless.log:

[2023-06-29 15:05:01,974] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] 91 documents, 16 tag(s), 38 correspondent(s), 9 document type(s). 0 storage path(es)
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-06-29 15:05:05,693] [DEBUG] [paperless.classifier] Training tags classifier...
[2023-06-29 15:05:23,120] [DEBUG] [paperless.classifier] Training correspondent classifier...

and then it gets stuck.

That last message comes from this spot in the code: https://github.com/paperless-ngx/paperless-ngx/blob/7a464d8a6eff11bcd0100330cb1687da50e196e6/src/documents/classifier.py#L279

benley commented 1 year ago

...which means it's not even getting stuck on the first call to MLPClassifier().fit(), as the tags classifier finishes. It's the second one.

ryane commented 1 year ago

I'm seeing the same behavior. The only difference is I am getting stuck at the tags classifier.

[2023-10-16 06:10:12,745] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-10-16 06:10:13,159] [DEBUG] [paperless.classifier] 309 documents, 14 tag(s), 1 correspondent(s), 0 document type(s). 1 storage path(es)
[2023-10-16 06:10:13,160] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-10-16 06:10:13,961] [DEBUG] [paperless.classifier] Training tags classifier...

It hangs forever here, pegging a single core.

 - system: `"x86_64-linux"`
 - host os: `Linux 6.1.57, NixOS, 23.11 (Tapir), 23.11.20231011.5e4c2ad`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.17.0`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
Atemu commented 1 year ago

It appears it can get stuck on any classification task. A temporary workaround is to disable classification for some of the lesser used tags/correspondents/doctypes until it works again.

Atemu commented 10 months ago

The likely cause for this bug has been found: OpenBLAS. Building numpy with i.e. the proprietary mkl BLAS implementation instead resolves this issue.

I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.

benley commented 9 months ago

I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.

The nixpkgs manual BLAS/LAPACK section suggests using LD_LIBRARY_PATH to select a different BLAS implementation at runtime:

$ LD_LIBRARY_PATH=$(nix-build -A mkl)/lib${LD_LIBRARY_PATH:+:}$LD_LIBRARY_PATH nix-shell -p octave --run octave

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

benley commented 9 months ago

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

So far this seems to actually work! I wasn't able to use the amd-blis library due to missing symbols, but mkl worked:

services.paperless.extraConfig = {
  LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
};

(that will be services.paperless.settings in nixos-unstable; I am running 23.11)

Atemu commented 9 months ago

Another thing we could perhaps try somehow is to prevent BLAS from loading somehow because the upstream wheels for numpy somehow don't include any BLAS implementation at all and therefore neither does the paperless-ngx docker image.

Lyndeno commented 7 months ago

One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.

So far this seems to actually work! I wasn't able to use the amd-blis library due to missing symbols, but mkl worked:

services.paperless.extraConfig = {
  LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
};

(that will be services.paperless.settings in nixos-unstable; I am running 23.11)

How well is this working now? I might be experiencing the same issue.

benley commented 7 months ago

How well is this working now? I might be experiencing the same issue.

I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.

SuperSandro2000 commented 7 months ago

Should we add this to the paperless module?

martinholters commented 7 months ago

FWIW, I've experienced the same problem and upon finding this issue also set LD_LIBRARY_PATH to use MKL, which made the problem go away for me, too. With MKL being non-free I'm not 100% happy with it, though. Are there any chances this might also be solved by employing a newer OpenBLAS version? I lack the time to dig deeper into this, unfortunately.

Atemu commented 7 months ago

OpenBLAS is on the newest release.

Lyndeno commented 7 months ago

How well is this working now? I might be experiencing the same issue.

I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.

Can confirm that adding MKL fixed the problem I was having. "Classifier file does not exist"

martinholters commented 7 months ago

OpenBLAS is on the newest release.

I'm using nixos-23.11 where OpenBLAS is at 0.3.24. But if the problem persists with 0.3.26 from nixos-unstable, just waiting for the update to propagate to the next release obviously won't be the solution, unfortunately.

martinholters commented 7 months ago

Instead of using MKL I've tried the following, and it seems to work for me, too:

services.paperless.extraConfig = {
  OPENBLAS_NUM_THREADS = 1;
  OMP_NUM_THREADS = 1;
  GOTO_NUM_THREADS = 1;
};

(It's probably sufficient to set one of them, but I wanted to be sure the setting takes effect no matter what.) Can someone confirm this makes the problem go away with OpenBLAS? Or was it just fluke for me?

Atemu commented 7 months ago

The culprit is OMP_NUM_THREADS. Setting it to 1 works around the issue.

martinholters commented 7 months ago

Aha: "OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1." And in NixOS, OpenBLAS is configured to set USE_OPENMP=1 on many platforms. So that's why OMP_NUM_THREADS is the one to set. But I've also found this: https://github.com/NixOS/nixpkgs/blob/56528ee42526794d413d6f244648aaee4a7b56c0/pkgs/development/libraries/science/math/openblas/default.nix#L6-L13 So maybe forcing to use OpenBLAS with singleThreaded = true might be the proper solution here?

Atemu commented 7 months ago

Paperless only uses OpenBLAS via scikit-learn via numpy. This is a generic library and paperless is not the exclusive user of any of these.

The proper solution is to figure out what causes OpenBLAS to spin on sched_yield() when multiple OMP threads are used and the reason might be anywhere in the stack. The next step is likely to find a more minimal reproducer as low in this stack as possible and go bother the relevant upstream with it.

At least FreeBSD appears to run into the same issue, so I don't think it's an obvious packaging issue on our end.

Upstream wheels of numpy do not appear to use OpenBLAS at all, perhaps we could also look into whether our numpy using OpenBLAS is necessary, supported and desirable.

Atemu commented 7 months ago

Proposed https://github.com/NixOS/nixpkgs/pull/299008 as a workaround until a proper solution is found.