IsoNet-cryoET / IsoNet

Self-supervised learning for isotropic cryoET reconstruction
https://www.nature.com/articles/s41467-022-33957-8
MIT License
67 stars 12 forks source link

Illegal Instruction (core dumped) #20

Open wjnicol opened 2 years ago

wjnicol commented 2 years ago

Hello,

I installed IsoNet no problem and can run all the preparation steps fine either with GUI or command line.

When I try to start the refining step through the GUI nothing happens. When I try through the command line I get an "Illegal Instruction (core dumped)" error (picture attached) Screenshot from 2021-11-03 13-10-16 . By googling the error it seems to be a cpu issue.

NVIDIA GeForce GTX 1080 running with NVIDIA drivers 470.63.01 Intel Xeon CPU E5-2687W 3.10GhZ x 16 Ubuntu 20.04 Python 3.8.10 cuDNN v8.2.4 for cuda 11.4 GCC 9.3.0 Cuda 11.4 tensorflow 2.4.0

Thank you for your help,

Best,

William J Nicolas

procyontao commented 2 years ago

Hi,

I do not have an exact solution to core dumped problem. But could you make the versions match what was shown on the tensorflow website? https://www.tensorflow.org/install/source#gpu

I recommend you to try some python virtual environment, such as anaconda.

wjnicol commented 2 years ago

Hello,

What exactly do you mean by creating a python virtual environment? Similar to how EMAN2 is installed?

I will investigate versions but I do not find a combination that fits my specs.

wjnicol commented 2 years ago

So I installed the most recent tensorflow instead, 2.6.0 and I have progress in the sense that I get a bunch of error messages:

11-05 14:34:20, INFO

Isonet starts refining

11-05 14:34:21, ERROR Traceback (most recent call last): File "/home/wjnicol/Repo/IsoNet/bin/refine.py", line 25, in run run_whole(args) File "/home/wjnicol/Repo/IsoNet/bin/refine.py", line 106, in run_whole from IsoNet.training.predict import predict File "/home/wjnicol/Repo/IsoNet/training/predict.py", line 4, in from tensorflow.keras.models import load_model File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/api/_v2/keras/init.py", line 10, in from keras import version File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/init.py", line 25, in from keras import models File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/models.py", line 20, in from keras import metrics as metrics_module File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/metrics.py", line 26, in from keras import activations File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/activations.py", line 20, in from keras.layers import advanced_activations File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/layers/init.py", line 23, in from keras.engine.input_layer import Input File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/engine/input_layer.py", line 21, in from keras.engine import base_layer File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/engine/base_layer.py", line 43, in from keras.mixed_precision import loss_scale_optimizer File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/mixed_precision/loss_scale_optimizer.py", line 18, in from keras import optimizers File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/optimizers.py", line 26, in from keras.optimizer_v2 import adadelta as adadelta_v2 File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/optimizer_v2/adadelta.py", line 22, in from keras.optimizer_v2 import optimizer_v2 File "/home/wjnicol/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 36, in keras_optimizers_gauge = tf.internal.monitoring.BoolGauge( File "/home/wjnicol/.local/lib/python3.8/site-packages/tensorflow/python/eager/monitoring.py", line 360, in init super(BoolGauge, self).init('BoolGauge', _bool_gauge_methods, File "/home/wjnicol/.local/lib/python3.8/site-packages/tensorflow/python/eager/monitoring.py", line 135, in init self._metric = self._metric_methods[self._label_length].create(*args) tensorflow.python.framework.errors_impl.AlreadyExistsError: Another metric with the same name already exists.

wjnicol commented 2 years ago

I think I did not install tensorflow properly. I followed the instructions you provided: pip install tensorflow-gpu==2.6.0 but when I read how to install tensorflow from the page you provide to check compatibility it involves many more steps. Should I do a proper installation of tensorflow or only the command you provide is enough?

Thanks,

procyontao commented 2 years ago

Hi,

I am sorry that you have to deal with these problems. We do encountered a lot of problems when versions do not match what are shown on website.

What you can do is to either: Download packages from https://developer.nvidia.com/cuda-toolkit https://developer.nvidia.com/cudnn and install.

Or use download anaconda: https://www.anaconda.com/

Here are commands for my recent installation: conda create --name tf2.5 conda activate tf2.5 conda install python=3.6 conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1 pip install tensorflow==2.5 pip install fire mrcfile tqdm scipy scikit-image export HDF5_USE_FILE_LOCKING=FALSE export PATH=/home/lytao/software/IsoNet/bin:$PATH export PYTHONPATH=/home/lytao/software:$PYTHONPATH

Hope that would help.

wjnicol commented 2 years ago

Ok I am trying this right now. After this i should launch isonet.py gui from the tf2.5 environment?

wjnicol commented 2 years ago

Ok so this works (i didn't do the last two exports to the path because I had already done that prior. I did however need to do pip install PyQt5 after your commands. From there isonet.py gui works fine and refining works ! However it seems to be using all 16 cores at 100% and it just suddenly crashes my computer which then reboots. By crashing I mean sudden black screen and then it boots. It really weird. Tried it twice.

![Uploading Screenshot from 2021-11-03 12-16-24.png…]()

wjnicol commented 2 years ago

Additional information: I am trying this on 3 bin4 tomograms, ~1k each...

procyontao commented 2 years ago

Thank you for your reporting this, there is a parameter that specify how many cpu you are going to use in preprocessing step.

procyontao commented 2 years ago

I suggest you start with tutorial dataset to observe the behavior of the program.

wjnicol commented 2 years ago

Even when i use 8 threads with the sample data or my data it does the same thing. The computer turns off My CPU has 8 double threaded cores. Am I asking for too much even when I say 8 cpus? I will try with one. Do you know of a log file in linux that reports various crashes and hardware issues. I'm wondering if your software is just too demanding for my computer.

wjnicol commented 2 years ago

I think It's making my system crash

wjnicol@caliban:~$ last -x | head | tac wjnicol :1 :1 Fri Nov 5 15:55 - crash (00:07) reboot system boot 5.11.0-37-generi Fri Nov 5 16:02 still running runlevel (to lvl 5) 5.11.0-37-generi Fri Nov 5 16:03 - 16:25 (00:21) wjnicol :1 :1 Fri Nov 5 16:03 - crash (00:21) reboot system boot 5.11.0-37-generi Fri Nov 5 16:24 still running runlevel (to lvl 5) 5.11.0-37-generi Fri Nov 5 16:25 - 16:46 (00:21) wjnicol :1 :1 Fri Nov 5 16:25 - crash (00:20) reboot system boot 5.11.0-37-generi Fri Nov 5 16:46 still running runlevel (to lvl 5) 5.11.0-37-generi Fri Nov 5 16:46 still running wjnicol :1 :1 Fri Nov 5 16:46 still logged in

wjnicol commented 2 years ago

Sorry for bombarding you with messages buti will be away from my workstation for 2 weeks and am trying to give you as much info as possible.

From this page, https://unix.stackexchange.com/questions/9819/how-to-find-out-from-the-logs-what-caused-system-shutdown , I found a way to get logs on why my comp shutsdown:

wjnicol@caliban:~$ grep -iv ': starting|kernel: .*: Power Button|watching system buttons|Stopped Cleaning Up|Started Crash recovery kernel' \

/var/log/messages /var/log/syslog /var/log/apcupsd \ | grep -iw 'recover[a-z]|power[a-z]|shut[a-z ]down|rsyslogd|ups' grep: /var/log/messages: No such file or directory /var/log/syslog:Nov 5 15:54:51 caliban apparmor.systemd[1012]: Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd /var/log/syslog:Nov 5 15:54:51 caliban systemd[1]: Finished Update UTMP about System Boot/Shutdown. /var/log/syslog:Nov 5 15:54:51 caliban systemd[1]: Finished Restore /etc/resolv.conf if the system crashed before the ppp link was shut down. /var/log/syslog:Nov 5 15:54:51 caliban rsyslogd: imuxsock: Acquired UNIX socket '/run/systemd/journal/syslog' (fd 3) from systemd. [v8.2001.0] /var/log/syslog:Nov 5 15:54:51 caliban rsyslogd: rsyslogd's groupid changed to 110 /var/log/syslog:Nov 5 15:54:51 caliban rsyslogd: rsyslogd's userid changed to 104 /var/log/syslog:Nov 5 15:54:51 caliban rsyslogd: [origin software="rsyslogd" swVersion="8.2001.0" x-pid="1063" x-info="https://www.rsyslog.com"] start /var/log/syslog:Nov 5 15:54:51 caliban kernel: [ 0.585685] pci 0000:05:00.1: D0 power state depends on 0000:05:00.0 /var/log/syslog:Nov 5 15:54:51 caliban kernel: [ 8.840032] EXT4-fs (nvme0n1): recovery complete /var/log/syslog:Nov 5 15:54:51 caliban kernel: [ 10.038314] EXT4-fs (sdc): recovery complete /var/log/syslog:Nov 5 15:54:51 caliban kernel: [ 11.837374] EXT4-fs (sdb1): recovery complete /var/log/syslog:Nov 5 15:54:51 caliban dbus-daemon[1043]: dbus[1043]: Unknown group "power" in message bus configuration file /var/log/syslog:Nov 5 15:54:51 caliban thermald[1075]: Need Linux PowerCap sysfs /var/log/syslog:Nov 5 15:54:51 caliban NetworkManager[1044]: [1636152891.6834] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-dns-resolved.conf, 20-connectivity-ubuntu.conf, no-mac-addr-change.conf) (run: 10-globally-managed-devices.conf) (etc: default-wifi-powersave-on.conf) /var/log/syslog:Nov 5 15:54:51 caliban systemd[1]: Started Unattended Upgrades Shutdown. /var/log/syslog:Nov 5 15:54:55 caliban systemd[1]: Started Daemon for power management. /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) config/udev: Adding input device Power Button (/dev/input/event1) /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: () Power Button: Applying InputClass "libinput keyboard catchall" /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) Using input driver 'libinput' for 'Power Button' /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: () Power Button: always reports core events /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: device removed /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) XINPUT: Adding extended input device "Power Button" (type: KEYBOARD, id 6) /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) config/udev: Adding input device Power Button (/dev/input/event0) /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: () Power Button: Applying InputClass "libinput keyboard catchall" /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) Using input driver 'libinput' for 'Power Button' /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: () Power Button: always reports core events /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: device removed /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) XINPUT: Adding extended input device "Power Button" (type: KEYBOARD, id 7) /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:54:56 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:55:07 caliban kernel: [ 27.621060] systemd-journald[411]: File /var/log/journal/6af7e9060f66425b8aafcb55c60d336b/user-2011.journal corrupted or uncleanly shut down, renaming and replacing. /var/log/syslog:Nov 5 15:55:07 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event1 - Power Button: device removed /var/log/syslog:Nov 5 15:55:07 caliban /usr/lib/gdm3/gdm-x-session[1382]: (II) event0 - Power Button: device removed /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) config/udev: Adding input device Power Button (/dev/input/event1) /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: () Power Button: Applying InputClass "libinput keyboard catchall" /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) Using input driver 'libinput' for 'Power Button' /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: () Power Button: always reports core events grep: /var/log/apcupsd*/var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event1 - Power Button: is tagged by udev as: Keyboard : No such file or directory /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event1 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event1 - Power Button: device removed /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) XINPUT: Adding extended input device "Power Button" (type: KEYBOARD, id 6) /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event1 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event1 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) config/udev: Adding input device Power Button (/dev/input/event0) /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: () Power Button: Applying InputClass "libinput keyboard catchall" /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) Using input driver 'libinput' for 'Power Button' /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: () Power Button: always reports core events /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event0 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event0 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event0 - Power Button: device removed /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) XINPUT: Adding extended input device "Power Button" (type: KEYBOARD, id 7) /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event0 - Power Button: is tagged by udev as: Keyboard /var/log/syslog:Nov 5 15:55:08 caliban /usr/lib/gdm3/gdm-x-session[1859]: (II) event0 - Power Button: device is a keyboard /var/log/syslog:Nov 5 15:55:08 caliban systemd[1759]: gnome-session-pre.target: Requested dependency OnFailure=gnome-session-shutdown.target ignored (target units cannot fail). /var/log/syslog:Nov 5 15:55:08 caliban systemd[1759]: gnome-session-initialized.target: Requested dependency OnFailure=gnome-session-shutdown.target ignored (target units cannot fail). /var/log/syslog:Nov 5 15:55:08 caliban systemd[1759]: gnome-session-failed.target: Requested dependency OnFailure=gnome-session-shutdown.target ignored (target units cannot fail). /var/log/syslog:Nov 5 15:55:10 caliban systemd[1759]: Started GNOME Power management handling. /var/log/syslog:Nov 5 15:55:10 caliban systemd[1759]: Reached target GNOME Power management handling.

procyontao commented 2 years ago

At least for tutorial dataset, we often use 20 cpus and 4 gpus 1080Ti. No such error/crash was observed. I think you can test with a much smaller dataset, e.g. 20 subtomos.

I do not know how to interpret those logs. I will inform you when I get some idea.

If you can, please let me know your commands to run IsoNet. If you are using GUI, please click print command.