LibrePhotos / librephotos

A self-hosted open source photo management service. This is the repository of the backend.
MIT License
6.95k stars 303 forks source link

torch causes Fatal Python error: Floating point exception #1198

Closed rw57 closed 4 months ago

rw57 commented 6 months ago

🐛 Bug Report

📝 Description of issue:

The log is filled with python exception traces like the below. I'm scanning in tens of thousands of photos on a fresh Docker install.

00:31:21 [Q] CRITICAL reincarnated worker Process-e59e78ff6711490fb016575816db4f62 after death 00:31:21 [Q] INFO Process-5affe1a61cf44377ab85d669f69acbb0 ready for work at 11707 00:31:21 [Q] INFO Process-5affe1a61cf44377ab85d669f69acbb0 processing coffee-uniform-ack-papa 'api.directory_watcher.handle_new_image' INFO:ownphotos:job f61d95b4-fbe3-4bda-a5e9-3e591c2aefed: calculate aspect ratio: /data/XXXXXXPATHTOMYPHOTOXXXXX.jpg, elapsed: 1.269778 Fatal Python error: Floating point exception

Current thread 0x00007fa671ffd040 (most recent call first): File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/conv.py", line 456 in _conv_forward File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/conv.py", line 460 in forward File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1520 in _call_impl File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1511 in _wrapped_call_impl File "/code/api/places365/wideresnet.py", line 95 in forward File "/code/api/places365/places365.py", line 140 in inference_places365 File "/code/api/models/photo.py", line 271 in _generate_captions File "/code/api/directory_watcher.py", line 168 in handle_new_image File "/usr/local/lib/python3.11/dist-packages/django_q/worker.py", line 97 in worker File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in init File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 191 in spawn_process File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 198 in spawn_worker File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 227 in reincarnate File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 306 in guard File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 167 in start File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 158 in init File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in init File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen File "/usr/lib/python3.11/multiprocessing/context.py", line 224 in _Popen File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start File "/usr/local/lib/python3.11/dist-packages/django_q/cluster.py", line 66 in start File "/usr/local/lib/python3.11/dist-packages/django_q/management/commands/qcluster.py", line 37 in handle File "/usr/local/lib/python3.11/dist-packages/django/core/management/base.py", line 458 in execute File "/usr/local/lib/python3.11/dist-packages/django/core/management/base.py", line 412 in run_from_argv File "/usr/local/lib/python3.11/dist-packages/django/core/management/init.py", line 436 in execute File "/usr/local/lib/python3.11/dist-packages/django/core/management/init.py", line 442 in execute_from_command_line File "/code/manage.py", line 31 in

Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, charset_normalizer.md, _cffi_backend, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, PIL._imagingft, yaml._yaml, matplotlib._c_internal_utils, matplotlib._path, kiwisolver._cext, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, matplotlib._image, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, scipy.stats.beta_ufunc, scipy.stats._boost.beta_ufunc, scipy.stats.binom_ufunc, scipy.stats._boost.binom_ufunc, scipy.stats.nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, scipy.stats.hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, scipy.stats.ncf_ufunc, scipy.stats._boost.ncf_ufunc, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.stats._unuran.unuran_wrapper, scipy.cluster._vq, scipy.cluster._hierarchy, scipy.cluster._optimal_leaf_ordering, sklearn.__check_build._check_build, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.neighbors._partition_nodes, sklearn.neighbors._ball_tree, sklearn.neighbors._kd_tree, sklearn.utils.arrayfuncs, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.svm._liblinear, sklearn.svm._libsvm, sklearn.svm._libsvm_sparse, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.decomposition._online_lda_fast, sklearn.decomposition._cdnmf_fast, hdbscan.dist_metrics, hdbscan._hdbscan_linkage, hdbscan._hdbscan_tree, hdbscan._hdbscan_reachability, hdbscan._hdbscan_boruvka, sklearn._isotonic, sklearn.tree._utils, sklearn.tree._tree, sklearn.tree._splitter, sklearn.tree._criterion, sklearn.neighbors._quad_tree, sklearn.manifold._barnes_hut_tsne, sklearn.manifold._utils, hdbscan._prediction_utils, PIL._imagingmath, PIL._webp (total: 232)

🔁 How can we reproduce it:

Unsure. This happened on a fresh install. I reproduced it by deleting all the librephotos and database folders and running again. I'm running on podman instead of docker but the web interface is working well and I can see that it has found my photos. I don't think the torch library should cause the librephotos job to crash like this. Does it need some exception handling to fail more gracefully?

It's certainly possible this is an artifact of using podman. Here is the podman kube file I'm using with podman play kube (note that in podman Pods, all containers share an IP address and localhost):

# Save the output of this file and use kubectl create -f to import
# it into Kubernetes.
#
# Created with podman-4.9.3

# NOTE: If you generated this yaml from an unprivileged and rootless podman container on an SELinux
# enabled system, check the podman generate kube man page for steps to follow to ensure that your pod/container
# has the right permissions to access the volumes added.
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-04-08T09:10:59Z"
  labels:
    app: librephotos
  name: librephotos
spec:
  containers:
  - args:
    - postgres
    - -c
    - fsync=off
    - -c
    - synchronous_commit=off
    - -c
    - full_page_writes=off
    - -c
    - random_page_cost=1.0
    env:
    - name: POSTGRES_USER
      value: docker
    - name: POSTGRES_PASSWORD
      value: MYPASSWORDHERE
    - name: POSTGRES_DB
      value: librephotos
    image: docker.io/library/postgres:13
    name: db
    volumeMounts:
    - mountPath: /var/lib/postgresql/data
      name: storage-storage-librephotos-data-db-host-0
  - args:
    - nginx
    - -g
    - daemon off;
    image: docker.io/reallibrephotos/librephotos-proxy:latest
    name: proxy
    ports:
    - containerPort: 80
      hostPort: 3000
    volumeMounts:
    - mountPath: /data
      name: storage-pictures-host-0
      readOnly: true
    - mountPath: /protected_media
      name: storage-storage-librephotos-data-protected_media-host-1
  - image: docker.io/reallibrephotos/librephotos-frontend:latest
    name: frontend
    securityContext: {}
  - env:
    - name: DB_PORT
      value: "5432"
    - name: BACKEND_HOST
      value: backend
    - name: DB_NAME
      value: librephotos
    - name: DB_BACKEND
      value: postgresql
    - name: DB_PASS
      value: MYPASSWORDHERE
    - name: DB_USER
      value: docker
    - name: DB_HOST
      value: localhost
    - name: DEBUG
      value: "0"
    - name: WEB_CONCURRENCY
      value: "1"
    - name: ALLOW_UPLOAD
      value: "false"
    image: docker.io/reallibrephotos/librephotos:latest
    name: backend
    volumeMounts:
    - mountPath: /root/.cache
      name: storage-storage-librephotos-data-cache-host-0
    - mountPath: /data
      name: storage-pictures-host-1
      readOnly: true
    - mountPath: /protected_media
      name: storage-storage-librephotos-data-protected_media-host-2
    - mountPath: /logs
      name: storage-storage-librephotos-data-logs-host-3
  volumes:
  - hostPath:
      path: /storage/librephotos/data/db
      type: Directory
    name: storage-storage-librephotos-data-db-host-0
  - hostPath:
      path: /pictures
      type: Directory
    name: storage-pictures-host-0
  - hostPath:
      path: /storage/librephotos/data/protected_media
      type: Directory
    name: storage-storage-librephotos-data-protected_media-host-1
  - hostPath:
      path: /storage/librephotos/data/cache
      type: Directory
    name: storage-storage-librephotos-data-cache-host-0
  - hostPath:
      path: /pictures
      type: Directory
    name: storage-pictures-host-1
  - hostPath:
      path: /storage/librephotos/data/protected_media
      type: Directory
    name: storage-storage-librephotos-data-protected_media-host-2
  - hostPath:
      path: /storage/librephotos/data/logs
      type: Directory
    name: storage-storage-librephotos-data-logs-host-3

Please provide additional information:

derneuere commented 6 months ago

This seems to be related to PyTorch. Hard crashes of PyTorch usually involve a bug in some instruction set of the CPU. Can you give me more information what kind of CPU you use and if there is maybe any virtualization involved?

rw57 commented 6 months ago

It is an older computer but should have sufficient memory and storage. I'm running Fedora CoreOS on bare metal so no virtualization. I didn't see a particular CPU requirement in the pyTorch documentation. Any idea what it needs? How do I bypass or disable pyTorch?

uname -a Linux hostname 6.7.7-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Mar 1 16:53:59 UTC 2024 x86_64 GNU/Linux

cat /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 5
model name      : AMD Athlon(tm) II X4 635 Processor
stepping        : 3
microcode       : 0x10000b6
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips        : 5786.01
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
derneuere commented 5 months ago

We upgraded to PyTorch 2.3, maybe this got fixed in that release :)