chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

3137 illegal hardware instruction (core dumped) python #1121

Open Mnikley opened 2 months ago

Mnikley commented 2 months ago

Describe the bug

Calling open_soma() returns an error 3137 illegal hardware instruction

To Reproduce

Setup

python3.11 -m venv venv
source venv/bin/activate
pip install -U cellxgene-census

Attempt to run

❯ python
Python 3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cellxgene_census
>>> soma = cellxgene_census.open_soma()
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
[1]    3766 illegal hardware instruction (core dumped)  python

Expected behavior

open_soma() not returning an error

Environment

ebezzi commented 2 months ago

Hey @Mnikley,

to debug this, it would be helpful to have more information on what hardware/configuration you're using. In particular, are you running this in Docker? If so, what is the CPU type of the host machine?

It would also be useful if you could run import tiledbsoma; tiledbsoma.show_package_versions() and paste the output. If you prefer, you can reach out to me via email at ebezzi@chanzuckerberg.com.

Mnikley commented 2 months ago

Hey @ebezzi,

thank you for your answer - it seems like the problem is indeed related to the CPU we are using, as we run a VM on a proxmox server. The code works fine on another machine with direct access to the CPU (no virtualization). However, it would be great to run the code on our VM anyway.

On the VM:

❯ python -c "import tiledbsoma; tiledbsoma.show_package_versions()"
tiledbsoma.__version__        1.9.5
TileDB-Py tiledb.version()    (0, 27, 1)
TileDB core version           2.21.1
[1]    6305 illegal hardware instruction (core dumped)  python -c "import tiledbsoma; tiledbsoma.show_package_versions()"

More details about the CPU in the VM:

❯ lscpu
Architecture:           x86_64
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        40 bits physical, 48 bits virtual
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              GenuineIntel
  Model name:           QEMU Virtual CPU version 2.5+
    CPU family:         15
    Model:              107
    Thread(s) per core: 1
    Core(s) per socket: 8
    Socket(s):          1
    Stepping:           1
    BogoMIPS:           6192.42
    Flags:              fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid pni ssse3 c
                        x16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cpuid_fault pti
Caches (sum of all):
  L1d:                  256 KiB (8 instances)
  L1i:                  256 KiB (8 instances)
  L2:                   32 MiB (8 instances)
  L3:                   16 MiB (1 instance)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-7
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        KVM: Mitigation: VMX unsupported
  L1tf:                 Mitigation; PTE Inversion
  Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:             Mitigation; PTI
  Mmio stale data:      Unknown: No mitigations
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Vulnerable
  Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

and the CPU information on the host machine running proxmox:

root@pve:~# lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          52 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) Xeon(R) w5-2465X
    BIOS Model name:      Intel(R) Xeon(R) w5-2465X  CPU @ 3.1GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                143
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             8
    CPU(s) scaling MHz:   27%
    CPU max MHz:          4700.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6192.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm const
                          ant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
                          ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_faul
                          t epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2
                          smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xg
                          etbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_
                          act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la5
                          7 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flu
                          sh_l1d arch_capabilities
Virtualization features:
  Virtualization:         VT-x
Caches (sum of all):
  L1d:                    768 KiB (16 instances)
  L1i:                    512 KiB (16 instances)
  L2:                     32 MiB (16 instances)
  L3:                     33.8 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Running the code directly on the host is unfortunately no option to us.

ebezzi commented 2 months ago

Hey @Mnikley ,

since we can't easily reproduce your issue (we don't have the same virtualization setup available), it would be useful if you could run the debugger and provide a stack trace that could help us identify the problem.

gdb -- python
(gdb) r
>>> import cellxgene_census
>>> soma = cellxgene_census.open_soma()
# illegal instruction should be caught and printed here...
(gdb) bt

This should generate a stack trace that should hopefully contain more information about the library that is causing the segmentation fault.

Mnikley commented 2 months ago

Dear @ebezzi ,

this is the stack trace:

>>> soma = cellxgene_census.open_soma()
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
[New Thread 0x7fffb4029640 (LWP 16613)]
[New Thread 0x7fffb3828640 (LWP 16614)]
[New Thread 0x7fffb3027640 (LWP 16615)]
[New Thread 0x7fffb2826640 (LWP 16616)]
[New Thread 0x7fffb2025640 (LWP 16617)]
[New Thread 0x7fffb1824640 (LWP 16618)]
[New Thread 0x7fffb1023640 (LWP 16619)]
[New Thread 0x7fffb0822640 (LWP 16620)]
[New Thread 0x7fffb0021640 (LWP 16621)]
[New Thread 0x7fffaf820640 (LWP 16622)]
[New Thread 0x7fffaf01f640 (LWP 16623)]
[New Thread 0x7fffae81e640 (LWP 16624)]
[New Thread 0x7fffae01d640 (LWP 16625)]
[New Thread 0x7fffad81c640 (LWP 16626)]
[New Thread 0x7fffad01b640 (LWP 16627)]
[New Thread 0x7fffac81a640 (LWP 16628)]
[New Thread 0x7fffac001640 (LWP 16629)]
[New Thread 0x7fffa3fff640 (LWP 16630)]
[New Thread 0x7fffab800640 (LWP 16631)]
[New Thread 0x7fffaafff640 (LWP 16632)]
[New Thread 0x7fffaa7fe640 (LWP 16633)]

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007fffc0ac7e32 in void std::vector<std::string, std::allocator<std::string> >::_M_realloc_insert<std::string const&>(__gnu_cxx::__normal_iterator<std::string*, std::vector<std::string, std::allocator<std::string> > >, std::string const&) () from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledbsoma-4d9c0eb1.so
(gdb) bt
#0  0x00007fffc0ac7e32 in void std::vector<std::string, std::allocator<std::string> >::_M_realloc_insert<std::string const&>(__gnu_cxx::__normal_iterator<std::string*, std::vector<std::string, std::allocator<std::string> > >, std::string const&) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledbsoma-4d9c0eb1.so
#1  0x00007fffbf760f62 in void Aws::Http::URI::AddPathSegments<std::string>(std::string) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#2  0x00007fffbf78bcf3 in Aws::Http::URI::SetPath(std::string const&) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#3  0x00007fffbf78beb7 in Aws::Http::URI::ExtractAndSetPath(std::string const&) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#4  0x00007fffbf78c083 in Aws::Http::URI::ParseURIParts(std::string const&) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#5  0x00007fffbf784c7d in Aws::Http::CreateHttpRequest(std::string const&, Aws::Http::HttpMethod, std::function<std::iostream* ()> const&) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#6  0x00007fffbf79ede8 in Aws::Internal::EC2MetadataClient::GetCurrentRegion() const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#7  0x00007fffbf8089aa in Aws::Client::ClientConfiguration::ClientConfiguration() ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#8  0x00007fffbed9bb2b in tiledb::sm::S3::init_client() const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#9  0x00007fffbeda1c39 in tiledb::sm::S3::is_dir(tiledb::sm::URI const&, bool*) const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#10 0x00007fffbedd7c92 in tiledb::sm::VFS::is_dir(tiledb::sm::URI const&, bool*) const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#11 0x00007fffbf4fa99e in tiledb::sm::StorageManagerCanonical::is_array(tiledb::sm::URI const&) const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#12 0x00007fffbf4fc183 in tiledb::sm::StorageManagerCanonical::object_type(tiledb::sm::URI const&, tiledb::sm::ObjectType*) const ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#13 0x00007fffbece08f1 in tiledb_object_type () from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledb-22504c42.so.2.21
#14 0x00007fffc0ae98e4 in tiledbsoma::SOMAObject::open(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledbsoma-4d9c0eb1.so
#15 0x00007fffc0cbbd4a in libtiledbsomacpp::load_soma_object(pybind11::module_&)::{lambda(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>)#1}::operator()(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>) const [clone .constprop.0] ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/pytiledbsoma.cpython-311-x86_64-linux-gnu.so
#16 0x00007fffc0cbda0f in pybind11::cpp_function::initialize<libtiledbsomacpp::load_soma_object(pybind11::module_&)::{lambda(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>)#1}, pybind11::object, std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::kw_only, pybind11::arg_v, pybind11::arg_v>(libtiledbsomacpp::load_soma_object(pybind11::module_&)::{lambda(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::optional<std::pair<unsigned long, unsigned long> >, std::optional<std::string>)#1}&&, pybind11::object (*)(std::basic_string_view<char, std::char_traits<char> >, OpenMode, std::shared_ptr<tiledbsoma::SOMAContext>, std::opti--Type <RET> for more, q to quit, c to continue without paging--
onal<std::pair<unsigned long, unsigned long> >, std::optional<std::string>), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::kw_only const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/pytiledbsoma.cpython-311-x86_64-linux-gnu.so
#17 0x00007fffc0c3e4fe in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/myuser/cellxgenetest/venv/lib/python3.11/site-packages/tiledbsoma/pytiledbsoma.cpython-311-x86_64-linux-gnu.so
#18 0x000000000055539b in ?? ()
#19 0x000000000052f46c in _PyObject_MakeTpCall ()
#20 0x000000000053d620 in _PyEval_EvalFrameDefault ()
#21 0x0000000000613974 in ?? ()
#22 0x0000000000612fd7 in PyEval_EvalCode ()
#23 0x0000000000633deb in ?? ()
#24 0x0000000000630044 in ?? ()
#25 0x00000000004f24b2 in ?? ()
#26 0x00000000004f25b4 in _PyRun_InteractiveLoopObject ()
#27 0x000000000046b0f0 in ?? ()
#28 0x00000000004f2209 in PyRun_AnyFileExFlags ()
#29 0x0000000000465074 in ?? ()
#30 0x00000000006042bd in Py_BytesMain ()
#31 0x00007ffff7c7cd90 in __libc_start_call_main (main=main@entry=0x604210, argc=argc@entry=1, argv=argv@entry=0x7fffffffcb58) at ../sysdeps/nptl/libc_start_call_main.h:58
#32 0x00007ffff7c7ce40 in __libc_start_main_impl (main=0x604210, argc=1, argv=0x7fffffffcb58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffcb48) at ../csu/libc-start.c:392
#33 0x0000000000604145 in _start ()
(gdb)

best, Matthias

rcurrie commented 2 months ago

I'm experiencing "Illegal instruction (core dumped)" as well. Interestingly I'm getting it on 2 of 3 machines all running from the same venv. Different locations for the faults. I've tried earlier and later versions of tiledbsoma with the same outcomes.

Works:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
    CPU family:          6
    Model:               63

Core Dump:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD Opteron(tm) Processor 6380
    CPU family:          21
    Model:               2

 (gdb...)
 [Thread 0x7ffed6ffd640 (LWP 896641) exited]

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007fff48c26def in std::__detail::_Compiler<std::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) () from

Core Dump:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         44 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
    CPU family:          6
    Model:               47

 (gdb...)
 [New Thread 0x7ffdf5ffb640 (LWP 2977224)]

Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007ffea8b7be32 in void std::vector<std::string, std::allocator<std::string> >::_M_realloc_insert<std::string const&>(__gnu_cxx::__normal_iterator<std::string*, std::vector<std::string, std::allocator<std::string> > >, std::string const&) () from /public/home/rcurrie/cellxgene/venv/lib/python3.10/site-packages/tiledbsoma/../tiledbsoma.libs/libtiledbsoma-4d9c0eb1.so

Package Details:

$ python -c "import tiledbsoma; tiledbsoma.show_package_versions()"
tiledbsoma.__version__        1.9.5
TileDB-Py tiledb.version()    (0, 27, 1)
TileDB core version           2.21.1
libtiledbsoma version()       libtiledb=2.21.1
python version                3.10.12.final.0
OS version                    Linux 5.15.0-78-generic
eddelbuettel commented 2 months ago

@rcurrie Thanks for posting cpu details! With these pages at Wikipedia (Intel, AMD) I see that these are in fact old:

Intel(R) Xeon(R) CPU E5-2697 v3 -- Sep 2014 Intel(R) Xeon(R) CPU E7- 4830 -- Apr 2011 AMD Opteron(tm) Processor 6380 -- Nov 2012

We have seen similar issues with old Xeons at CRAN for the R builds also experiencing the 'illegal instruction'. So we then split pre-made artifacts into 'normal' and 'non-AVX2' builds. One can check for what the cpus have: Eg on my (only ~five or six years old) computer I see cat /proc/cpuinfo | grep avx2 } head -1 return a match whereas on an older Xeon 6226 I do not. That Xeon machine for example also has issues with Arrow builds when pre-made blobs are integrated to speed the build up.

All this may suggest that the Python wheels might be too new for the hardware used. A quick workaround may be to build from source.

rcurrie commented 2 months ago

Makes sense - will try building from source and report back - thank for the quick reply!

rcurrie commented 1 month ago

@eddelbuettel Building TileDB-SOMA from source worked! I'll add the order to the issue @johnkerl opened.