brianhie / scanorama

Panoramic stitching of single cell data
http://scanorama.csail.mit.edu
MIT License
261 stars 49 forks source link

Illegal instruction (core dumped) on HPC cluster #73

Closed syouligan closed 3 years ago

syouligan commented 4 years ago

Hi Brian

Thanks for your work. I am running into a 'Illegal instruction' error when running scanorama on a >100000 cell dataset on a HPC. I downgraded annoy to 1.11.5 and have tried playing with both sketch and batch_size to no avail. Same command runs fine on my local mac using 5% subset of total dataset to be run on the HPC. Any info you might have would be really appreciated.

Thanks again

>> import os
>> import scanpy as sc
>> import anndata as ad
>> import scanorama

>> scrm = scanorama.integrate_scanpy(adatas, knn=20)
Found 5907 genes among all datasets
Illegal instruction (core dumped)

HPC info - CentOS release 6.10 (Final)

Python info - Python 3.6.7 annoy 1.11.5 scanorama 1.6

brianhie commented 4 years ago

This is so strange -- it's most likely an annoy problem (it's living up to it's name). Can you try range searching annoy versions and see if any of them work?

syouligan commented 4 years ago

Yeah Ill take a look. It does seem there is dependency for scanorama 1.6 that annoy => 1.11.5 correct?

brianhie commented 4 years ago

@syouligan that's a dependency that worked for me on my Ubuntu Linux box. Unfortunately it seems like perhaps different versions of annoy might fail on different machine architectures...

brianhie commented 3 years ago

@syouligan I hope this issue has been resolved, and please let me know if you were able to successfully run annoy and, if so, how. I'll close the issue since it doesn't appear to be an issue with Scanorama, but please feel free to reopen if it is.

ivirshup commented 3 years ago

Hey, I think we're seeing this happen occasionally over on the scanpy CI builds. It's happened a few times over the past couple days, but only on the python 3.6 builds. It has always worked when we restart the job.

Here's a link to the build logs for a failed build (though I think azure deletes these kind quickly)

Here's the error reporting from that log ``` 2021-02-03T05:35:38.5458406Z scanpy/tests/external/test_scanorama_integrate.py Fatal Python error: Illegal instruction 2021-02-03T05:35:38.5459129Z 2021-02-03T05:35:38.5462592Z Thread 0x00007ff3519d9700 (most recent call first): 2021-02-03T05:35:38.5511855Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 295 in wait 2021-02-03T05:35:38.5519475Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/queue.py", line 164 in get 2021-02-03T05:35:38.5520244Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/multiprocessing/pool.py", line 463 in _handle_results 2021-02-03T05:35:38.5520966Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 864 in run 2021-02-03T05:35:38.5531630Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5532576Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5533071Z 2021-02-03T05:35:38.5533632Z Thread 0x00007ff360df2700 (most recent call first): 2021-02-03T05:35:38.5534348Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 295 in wait 2021-02-03T05:35:38.5544722Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/queue.py", line 164 in get 2021-02-03T05:35:38.5545714Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks 2021-02-03T05:35:38.5546879Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 864 in run 2021-02-03T05:35:38.5547649Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5558874Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5559459Z 2021-02-03T05:35:38.5560048Z Thread 0x00007ff3615f3700 (most recent call first): 2021-02-03T05:35:38.5561074Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers 2021-02-03T05:35:38.5561789Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 864 in run 2021-02-03T05:35:38.5573293Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5574140Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5574792Z 2021-02-03T05:35:38.5575323Z Thread 0x00007ff36927a700 (most recent call first): 2021-02-03T05:35:38.5576011Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 295 in wait 2021-02-03T05:35:38.5576732Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/queue.py", line 164 in get 2021-02-03T05:35:38.5587956Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/multiprocessing/pool.py", line 108 in worker 2021-02-03T05:35:38.5589293Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 864 in run 2021-02-03T05:35:38.5589844Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5600985Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5601528Z 2021-02-03T05:35:38.5602057Z Thread 0x00007ff369a7b700 (most recent call first): 2021-02-03T05:35:38.5602718Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 295 in wait 2021-02-03T05:35:38.5603399Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/queue.py", line 164 in get 2021-02-03T05:35:38.5613523Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/multiprocessing/pool.py", line 108 in worker 2021-02-03T05:35:38.5614748Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 864 in run 2021-02-03T05:35:38.5615548Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5616348Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5616849Z 2021-02-03T05:35:38.5617387Z Thread 0x00007ff38753f700 (most recent call first): 2021-02-03T05:35:38.5628235Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 299 in wait 2021-02-03T05:35:38.5629140Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 551 in wait 2021-02-03T05:35:38.5630702Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/tqdm/_monitor.py", line 60 in run 2021-02-03T05:35:38.5641159Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 916 in _bootstrap_inner 2021-02-03T05:35:38.5642097Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/threading.py", line 884 in _bootstrap 2021-02-03T05:35:38.5642575Z 2021-02-03T05:35:38.5643162Z Current thread 0x00007ff430183680 (most recent call first): 2021-02-03T05:35:38.5644334Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/scanorama/scanorama.py", line 522 in nn_approx 2021-02-03T05:35:38.5654845Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/scanorama/scanorama.py", line 590 in fill_table 2021-02-03T05:35:38.5656237Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/scanorama/scanorama.py", line 632 in find_alignments_table 2021-02-03T05:35:38.5657498Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/scanorama/scanorama.py", line 668 in find_alignments 2021-02-03T05:35:38.5669867Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/scanorama/scanorama.py", line 794 in assemble 2021-02-03T05:35:38.5671238Z File "/home/vsts/work/1/s/scanpy/external/pp/_scanorama_integrate.py", line 128 in scanorama_integrate 2021-02-03T05:35:38.5672086Z File "/home/vsts/work/1/s/scanpy/tests/external/test_scanorama_integrate.py", line 16 in test_scanorama_integrate 2021-02-03T05:35:38.5684357Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call 2021-02-03T05:35:38.5686031Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall 2021-02-03T05:35:38.5698110Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 87 in 2021-02-03T05:35:38.5699445Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec 2021-02-03T05:35:38.5700768Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ 2021-02-03T05:35:38.5712997Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/python.py", line 1641 in runtest 2021-02-03T05:35:38.5714294Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call 2021-02-03T05:35:38.5715516Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall 2021-02-03T05:35:38.5725051Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 87 in 2021-02-03T05:35:38.5726319Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec 2021-02-03T05:35:38.5727506Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ 2021-02-03T05:35:38.5776591Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 255 in 2021-02-03T05:35:38.5785160Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 311 in from_call 2021-02-03T05:35:38.5786937Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 255 in call_runtest_hook 2021-02-03T05:35:38.5795394Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 215 in call_and_report 2021-02-03T05:35:38.5796840Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 126 in runtestprotocol 2021-02-03T05:35:38.5797977Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol 2021-02-03T05:35:38.5805864Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall 2021-02-03T05:35:38.5807568Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 87 in 2021-02-03T05:35:38.5808972Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec 2021-02-03T05:35:38.5816495Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ 2021-02-03T05:35:38.5818245Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/main.py", line 348 in pytest_runtestloop 2021-02-03T05:35:38.5825772Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall 2021-02-03T05:35:38.5827031Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 87 in 2021-02-03T05:35:38.5828245Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec 2021-02-03T05:35:38.5836746Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ 2021-02-03T05:35:38.5838017Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/main.py", line 323 in _main 2021-02-03T05:35:38.5839345Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/main.py", line 269 in wrap_session 2021-02-03T05:35:38.5849790Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main 2021-02-03T05:35:38.5851414Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall 2021-02-03T05:35:38.5852879Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 87 in 2021-02-03T05:35:38.5864153Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec 2021-02-03T05:35:38.5865621Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ 2021-02-03T05:35:38.5866852Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/config/__init__.py", line 163 in main 2021-02-03T05:35:38.5878209Z File "/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/_pytest/config/__init__.py", line 185 in console_main 2021-02-03T05:35:38.5879131Z File "/opt/hostedtoolcache/Python/3.6.12/x64/bin/pytest", line 8 in 2021-02-03T05:38:07.8309062Z /home/vsts/work/_temp/c1e24372-4126-4b27-b7c8-991137375783.sh: line 1: 2971 Illegal instruction (core dumped) pytest --color=yes --ignore=scanpy/tests/_images --nunit-xml="nunit/test-results.xml" 2021-02-03T05:38:07.8400889Z ##[error]Bash exited with code '132'. 2021-02-03T05:38:07.8672087Z ##[section]Finishing: PyTest ```

And here is a gist of the full log if you want to check installation versions.

brianhie commented 3 years ago

@ivirshup thanks for letting me know. This call stack error message is great -- usually a segfault/illegal instruction just terminates Python without a helpful error message. It looks like it's failing in a call to the annoy tree data structure: https://github.com/brianhie/scanorama/blob/763e581d1efa61f60a0b41a50b51c38cdd008269/scanorama/scanorama.py#L522

Scanorama is written in pure Python so any invalid memory accesses or problems at the architecture level are most likely a problem in an underlying piece of code, in this case annoy. I think it's been a problem in the past as well. If you're noticing a flapping test, then maybe annoy gets confused about the architecture configuration on different calls. It looks like annoy has problems on different architectures:

Because the error occurs in a library call, it may be hard for me to debug without first consulting the annoy code base, which I am not familiar with. What I can do is use a different library to do approximate nearest neighbors, but this will most likely change the output of Scanorama, which I am inclined to try to avoid for now.

Have you noticed problems with annoy in the past? If the flapping test is too much an issue, I'd also be fine with removing the Scanorama dependency within Scanpy as well and just keeping them separate repos.

ivirshup commented 3 years ago

What I can do is use a different library to do approximate nearest neighbors, but this will most likely change the output of Scanorama, which I am inclined to try to avoid for now.

This was a potential change I was gonna suggest, but it could change results. That said, IIRC, you're not using the specifics of annoy much, just that it does approximate graph construction and search? There is a sklearn transformer API for this (package with implementation, though I recall reading about this going into sklearn itself) which could make it easier to be flexible about the backend for this if this dependency give's you trouble.

If the flapping test is too much an issue, I'd also be fine with removing the Scanorama dependency within Scanpy as well and just keeping them separate repos.

This just started a few days ago, and may go away soon. It's just annoying to have a PR build pass, then have the badge go red cause it fails on master 😞.

If it keeps happening, I'll try a few things (one is dropping python 3.6 support a la numpy). But that it's crashing the process is a bad problem, since we can't just xfail the test.

brianhie commented 3 years ago

Great! So on my end, maybe I can just test a new approx NN backend, then just say in the README or something that annoy can be used to reproduce previous results but we recommend the less annoying backend that is more robust to different computer architectures.

It would be nice to have Python 3.6 support but do keep me posted on if the tests keep failing or if they get better, or worse!

brianhie commented 3 years ago

@ivirshup one quick fix to make the test pass, in case you need a very quick patch, is to just pass in approx=False as a parameter to scanorama_integrate() in the test: https://github.com/theislab/scanpy/blob/master/scanpy/tests/external/test_scanorama_integrate.py#L16. Scanorama will then do nearest neighbors search with sklearn, which is still pretty fast.

ivirshup commented 3 years ago

Thanks for the suggestion @brianhie! I've added that.