Closed rw57 closed 4 months ago
This seems to be related to PyTorch. Hard crashes of PyTorch usually involve a bug in some instruction set of the CPU. Can you give me more information what kind of CPU you use and if there is maybe any virtualization involved?
It is an older computer but should have sufficient memory and storage. I'm running Fedora CoreOS on bare metal so no virtualization. I didn't see a particular CPU requirement in the pyTorch documentation. Any idea what it needs? How do I bypass or disable pyTorch?
uname -a
Linux hostname 6.7.7-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Mar 1 16:53:59 UTC 2024 x86_64 GNU/Linux
cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X4 635 Processor
stepping : 3
microcode : 0x10000b6
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 5786.01
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X4 635 Processor
stepping : 3
microcode : 0x10000b6
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 5786.01
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X4 635 Processor
stepping : 3
microcode : 0x10000b6
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 5786.01
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 3
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X4 635 Processor
stepping : 3
microcode : 0x10000b6
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 5786.01
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
We upgraded to PyTorch 2.3, maybe this got fixed in that release :)
🐛 Bug Report
📝 Description of issue:
The log is filled with python exception traces like the below. I'm scanning in tens of thousands of photos on a fresh Docker install.
🔁 How can we reproduce it:
Unsure. This happened on a fresh install. I reproduced it by deleting all the librephotos and database folders and running again. I'm running on podman instead of docker but the web interface is working well and I can see that it has found my photos. I don't think the torch library should cause the librephotos job to crash like this. Does it need some exception handling to fail more gracefully?
It's certainly possible this is an artifact of using podman. Here is the podman kube file I'm using with podman play kube (note that in podman Pods, all containers share an IP address and localhost):
Please provide additional information: