Closed benley closed 7 months ago
a tiny bit more context:
The classifier task gets as far as printing these messages in /var/lib/paperless/log/paperless.log:
[2023-06-29 15:05:01,974] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] 91 documents, 16 tag(s), 38 correspondent(s), 9 document type(s). 0 storage path(es)
[2023-06-29 15:05:02,341] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-06-29 15:05:05,693] [DEBUG] [paperless.classifier] Training tags classifier...
[2023-06-29 15:05:23,120] [DEBUG] [paperless.classifier] Training correspondent classifier...
and then it gets stuck.
That last message comes from this spot in the code: https://github.com/paperless-ngx/paperless-ngx/blob/7a464d8a6eff11bcd0100330cb1687da50e196e6/src/documents/classifier.py#L279
...which means it's not even getting stuck on the first call to MLPClassifier().fit()
, as the tags classifier finishes. It's the second one.
I'm seeing the same behavior. The only difference is I am getting stuck at the tags classifier.
[2023-10-16 06:10:12,745] [DEBUG] [paperless.classifier] Gathering data from database...
[2023-10-16 06:10:13,159] [DEBUG] [paperless.classifier] 309 documents, 14 tag(s), 1 correspondent(s), 0 document type(s). 1 storage path(es)
[2023-10-16 06:10:13,160] [DEBUG] [paperless.classifier] Vectorizing data...
[2023-10-16 06:10:13,961] [DEBUG] [paperless.classifier] Training tags classifier...
It hangs forever here, pegging a single core.
- system: `"x86_64-linux"`
- host os: `Linux 6.1.57, NixOS, 23.11 (Tapir), 23.11.20231011.5e4c2ad`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.17.0`
- nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
It appears it can get stuck on any classification task. A temporary workaround is to disable classification for some of the lesser used tags/correspondents/doctypes until it works again.
The likely cause for this bug has been found: OpenBLAS. Building numpy with i.e. the proprietary mkl BLAS implementation instead resolves this issue.
I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.
I'm not sure how to start approaching upstream in this as OpenBLAS is like 3 libraries down the dependency tree.
The nixpkgs manual BLAS/LAPACK section suggests using LD_LIBRARY_PATH to select a different BLAS implementation at runtime:
$ LD_LIBRARY_PATH=$(nix-build -A mkl)/lib${LD_LIBRARY_PATH:+:}$LD_LIBRARY_PATH nix-shell -p octave --run octave
One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.
One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.
So far this seems to actually work! I wasn't able to use the amd-blis
library due to missing symbols, but mkl
worked:
services.paperless.extraConfig = {
LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib";
};
(that will be services.paperless.settings
in nixos-unstable; I am running 23.11)
Another thing we could perhaps try somehow is to prevent BLAS from loading somehow because the upstream wheels for numpy somehow don't include any BLAS implementation at all and therefore neither does the paperless-ngx docker image.
One could of course use an overlay to override BLAS, but that triggers a huge number of rebuilds on most systems. I'm going to see if I can get something working with LD_LIBRARY_PATH.
So far this seems to actually work! I wasn't able to use the
amd-blis
library due to missing symbols, butmkl
worked:services.paperless.extraConfig = { LD_LIBRARY_PATH = "${lib.getLib pkgs.mkl}/lib"; };
(that will be
services.paperless.settings
in nixos-unstable; I am running 23.11)
How well is this working now? I might be experiencing the same issue.
How well is this working now? I might be experiencing the same issue.
I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.
Should we add this to the paperless module?
FWIW, I've experienced the same problem and upon finding this issue also set LD_LIBRARY_PATH
to use MKL, which made the problem go away for me, too.
With MKL being non-free I'm not 100% happy with it, though. Are there any chances this might also be solved by employing a newer OpenBLAS version? I lack the time to dig deeper into this, unfortunately.
OpenBLAS is on the newest release.
How well is this working now? I might be experiencing the same issue.
I'm still running this config and the problem still has not returned in my setup, even after adding some new auto classifiers.
Can confirm that adding MKL fixed the problem I was having. "Classifier file does not exist"
OpenBLAS is on the newest release.
I'm using nixos-23.11 where OpenBLAS is at 0.3.24. But if the problem persists with 0.3.26 from nixos-unstable, just waiting for the update to propagate to the next release obviously won't be the solution, unfortunately.
Instead of using MKL I've tried the following, and it seems to work for me, too:
services.paperless.extraConfig = {
OPENBLAS_NUM_THREADS = 1;
OMP_NUM_THREADS = 1;
GOTO_NUM_THREADS = 1;
};
(It's probably sufficient to set one of them, but I wanted to be sure the setting takes effect no matter what.) Can someone confirm this makes the problem go away with OpenBLAS? Or was it just fluke for me?
The culprit is OMP_NUM_THREADS
. Setting it to 1 works around the issue.
Aha:
"OpenBLAS ignores OPENBLAS_NUM_THREADS
and GOTO_NUM_THREADS
when compiled with USE_OPENMP=1
."
And in NixOS, OpenBLAS is configured to set USE_OPENMP=1
on many platforms. So that's why OMP_NUM_THREADS
is the one to set. But I've also found this:
https://github.com/NixOS/nixpkgs/blob/56528ee42526794d413d6f244648aaee4a7b56c0/pkgs/development/libraries/science/math/openblas/default.nix#L6-L13
So maybe forcing to use OpenBLAS with singleThreaded = true
might be the proper solution here?
Paperless only uses OpenBLAS via scikit-learn via numpy. This is a generic library and paperless is not the exclusive user of any of these.
The proper solution is to figure out what causes OpenBLAS to spin on sched_yield()
when multiple OMP threads are used and the reason might be anywhere in the stack. The next step is likely to find a more minimal reproducer as low in this stack as possible and go bother the relevant upstream with it.
At least FreeBSD appears to run into the same issue, so I don't think it's an obvious packaging issue on our end.
Upstream wheels of numpy do not appear to use OpenBLAS at all, perhaps we could also look into whether our numpy using OpenBLAS is necessary, supported and desirable.
Proposed https://github.com/NixOS/nixpkgs/pull/299008 as a workaround until a proper solution is found.
Describe the bug
Directly related to this discussion: https://github.com/paperless-ngx/paperless-ngx/discussions/2373
paperless-ngx runs a hourly
documents.tasks.train_classifier
celery beat task. This is supposed to take a few minutes on most systems, but on NixOS (for some users, at least) the task runs forever and is eventually killed by the celery timeout. The affected worker process doesn't respond to SIGTERM or even SIGQUIT; only SIGKILL can interrupt it.Steps To Reproduce
Steps to reproduce the behavior:
Expected behavior
The classifier training task should finish successfully in a few minutes or less.
Screenshots
Additional context
See https://github.com/paperless-ngx/paperless-ngx/discussions/2373 for more background. This appears to be specific to paperless-ngx on NixOS.
Notify maintainers
@lukegb @gador @erikarvstedt @Flakebi
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result."x86_64-linux"
Linux 5.15.83, NixOS, 23.05 (Stoat), 23.05.git.9790f3242da2M
yes
yes
nix-env (Nix) 2.13.3
""
"home-manager-23.05.tar.gz, nixos-23.05"
/nix/var/nix/profiles/per-user/root/channels/nixos