NSLS-II / Bug-Reports

Unified issue-tracker for bugs in the data acquisition, management, and analysis software at NSLS-II
BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

Disk I/O error from BSUI #172

Closed cmazzoli closed 7 years ago

cmazzoli commented 7 years ago
[xf23id1@xf23id-ws3 ~/Users/2017_06_Comin]$ bsui
discarding /opt/conda/bin from PATH
prepending /opt/conda_envs/collection-17Q2.0/bin to PATH
[TerminalIPythonApp] ERROR | Failed to open SQLite history /epics/xf/23id/ipython/profile_collection/history.sqlite (disk I/O error).
[TerminalIPythonApp] ERROR | History file was moved to /epics/xf/23id/ipython/profile_collection/history-corrupt.sqlite and a new file created.
[TerminalIPythonApp] ERROR | Failed to open SQLite history /epics/xf/23id/ipython/profile_collection/history.sqlite (disk I/O error).
[TerminalIPythonApp] ERROR | History file was moved to /epics/xf/23id/ipython/profile_collection/history-corrupt.sqlite and a new file created.
[TerminalIPythonApp] ERROR | Failed to open SQLite history /epics/xf/23id/ipython/profile_collection/history.sqlite (disk I/O error).
[TerminalIPythonApp] ERROR | Failed to load history too many times, history will not be saved.
Activating auto-logging. Current session state plus future input saved.
Filename       : /epics/xf/23id/ophyd/logs/xf23id1_log.py
Mode           : append
Output logging : False
Raw input log  : False
Timestamping   : False
State          : active
Python 3.6.1 |Continuum Analytics, Inc.| (default, Mar 22 2017, 19:54:23) 
Type "copyright", "credits" or "license" for more information.

IPython 5.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

IPython profile: collection
Using matplotlib backend: Qt5Agg
/epics/xf/23id/ipython/profile_collection/startup/00-startup.py:21: UserWarning: stop using raw fs
  register_builtin_handlers(db.fs)
Storing metadata history in a new file at /home/xf23id1/.config/bluesky/bluesky_history.db.
[TerminalIPythonApp] WARNING | Unknown error in handling startup files:
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/IPython/core/shellapp.py in _exec_file(self, fname, shell_futures)
    326                                                  self.shell.user_ns,
    327                                                  shell_futures=shell_futures,
--> 328                                                  raise_exceptions=True)
    329         finally:
    330             sys.argv = save_argv

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/IPython/core/interactiveshell.py in safe_execfile(self, fname, *where, **kw)
   2479                 py3compat.execfile(
   2480                     fname, glob, loc,
-> 2481                     self.compile if kw['shell_futures'] else None)
   2482             except SystemExit as status:
   2483                 # If the call was made with 0 or None exit status (sys.exit(0)

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/IPython/utils/py3compat.py in execfile(fname, glob, loc, compiler)
    184         with open(fname, 'rb') as f:
    185             compiler = compiler or compile
--> 186             exec(compiler(f.read(), fname, 'exec'), glob, loc)
    187 
    188     # Refactor print statements in doctests.

/epics/xf/23id/ipython/profile_collection/startup/00-startup.py in <module>()
     30 # Subscribe metadatastore to documents.
     31 # If this is removed, data is not saved to metadatastore.
---> 32 from bluesky.global_state import gs
     33 gs.RE.subscribe_lossless('all', db.insert)
     34 

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/bluesky/global_state.py in <module>()
     14 
     15 
---> 16 class RunEngineTraitType(TraitType):
     17 
     18     info_text = 'a RunEngine instance'

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/bluesky/global_state.py in RunEngineTraitType()
     17 
     18     info_text = 'a RunEngine instance'
---> 19     default_value = RunEngine(get_history())
     20 
     21     def validate(self, obj, value):

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/bluesky/utils.py in get_history()
    670             os.makedirs(os.path.dirname(path), exist_ok=True)
    671             print("Storing metadata history in a new file at %s." % path)
--> 672             return historydict.HistoryDict(path)
    673         except IOError as exc:
    674             print(exc)

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/historydict.py in __init__(self, fname)
     61             logger.debug("Created a fresh table in %s,",
     62                          fname)
---> 63             self._create_tables()
     64         else:
     65             logger.debug("Found an existing table in %s", fname)

/opt/conda_envs/collection-17Q2.0/lib/python3.6/site-packages/historydict.py in _create_tables(self)
    201         Create the required tables for data storage
    202         """
--> 203         self._conn.execute(CREATION_QUERY)
    204         self._put(self.RESERVED_KEY_KEY, [])
    205 

OperationalError: disk I/O error

In [1]: 
cmazzoli commented 7 years ago

After Dan and Matt analysis showing that it was a sqlite problem, Matt solved it by un-mounting and mounting back the NFS. For us, it is necessary to reboot the workstation. Thanks to you all.

cowanml commented 7 years ago

Debugged with Claudio and Dan at the beamline...

Eventually narrowed it down to sqlite hanging on db insert on run engine history databases in nfshome:/home/xf23id-ws3 (but worked fine on both adjacent workstations).

My theory is the nfs client somehow got into some wonky state where it wasn't able to do file locking. Odd behavior in some other apps and when trying to close apps and logout sorta supports the theory.

Logged out of xf23id1, logged in as root (on text console), killed all remaining processes using nfshome, restarted nfs-common and autofs services, and all seems well... If we see this error again, a reboot is sadly the best fix. Andi said they've run into this before and a reboot fixed it.

Also have lots of this in /var/log/messages:

... Jul 5 14:51:10 xf23id-ws3 kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880eef4ad820! Jul 5 14:51:10 xf23id-ws3 kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880f1a303e20! Jul 5 14:52:16 xf23id-ws3 kernel: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff880f53775a20! ...

And digging through the logs I see it happened a few times on June 17.

I'll do a bit more research to try to find out if this is a known issue (with a known fix).

For those finding this through the nsls2 trac issue# 1260, note also the matching github issue (https://github.com/nsls-ii/bug-reports/issues/172).

I think we could close the github issue, this seems clearly more a sysadmin issue.

-matt

cmazzoli commented 7 years ago

I lost all the information stored in history apparently... like scan numbers and so on. Could somebody help on this please??

cowanml commented 7 years ago

You could try restoring the backup .sqlite files? Looking @ the file size will probably tell you if any have useful contents.

----- Reply message ----- From: "cmazzoli" notifications@github.com To: "NSLS-II/Bug-Reports" Bug-Reports@noreply.github.com Cc: "Cowan, Matt" cowan@bnl.gov, "Comment" comment@noreply.github.com Subject: [NSLS-II/Bug-Reports] Disk I/O error from BSUI (#172) Date: Wed, Jul 5, 2017 6:36 PM

I lost all the information stored in history apparently... like scan numbers and so on. Could somebody help on this please??

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/NSLS-II/Bug-Reports/issues/172#issuecomment-313245802, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFOf4xJBq70oUYk32_GGL1mD-PF6mbZeks5sLA_ugaJpZM4OOr93.

cmazzoli commented 7 years ago

Thanks. Mmmm.. I reentered by hand the last scan_id we thought we did on the beamline but I was wondering if anything else is missing that I haven't spotted yet. Any way of checking?

tacaswell commented 7 years ago

The transient scan_id is the only thing the RunEngine automatically puts in / updates. Everything else user meta-data and iirc is set by the start-up scripts.

One way to check is to look at what is in

db[-1].start

(or maybe a bit further back) to check for the scan_id and for the other meta-data.

cowanml commented 7 years ago

The current setup of the 'collection' ipython profile for 'xf23id1' depends on 2 different nfs servers (nfshome and nfsapp), bouncing between them 3 times via symlinks, with two autofs mount points involved for good measure!

I love the Tom and Jerry episodes where Tom builds Rube Goldberg mousetraps, but... https://www.google.com/search?q=tom+and+jerry+rube+goldberg+mousetrap&tbm=vid

If involving nfsapp is just a leftover from briefly having local homedirs, I strongly suggest simplifying as much as possible, something like:

[xf23id1@xf23id-ws3 ~]$ rm .ipython/profile_collection [xf23id1@xf23id-ws3 ~]$ mv xf23id1_profiles/{.git,profile_collection} .ipython/

and cleaning up all the vestigial cruft.

-matt


[xf23id1@xf23id-ws3 ~]$ pwd /home/xf23id1

[xf23id1@xf23id-ws3 ~]$ df /home Filesystem 1K-blocks Used Available Use% Mounted on auto.home 0 0 0 - /home

[xf23id1@xf23id-ws3 ~]$ df /home/xf23id1 Filesystem 1K-blocks Used Available Use% Mounted on 10.0.137.54:/home/xf23id1 19314530304 15737658376 3576871928 82% /home/xf23id1

[xf23id1@xf23id-ws3 ~]$ host 10.0.137.54 54.137.0.10.in-addr.arpa domain name pointer nfshome.cs.nsls2.local.

[xf23id1@xf23id-ws3 ~]$ ls -ald .ipython drwxr-xr-x 6 xf23id1 121 Oct 6 2015 .ipython [xf23id1@xf23id-ws3 ~]$ ls -ald .ipython/*collection lrwxrwxrwx 1 xf23id1 41 Sep 24 2015 .ipython/profile_collection -> /epics/xf/23id/ipython/profile_collection

[xf23id1@xf23id-ws3 ~]$ df /epics/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/md2 384233532 20272936 344419556 6% /

[xf23id1@xf23id-ws3 ~]$ ls -ald /epics drwxr-xr-x 3 root 4.0K Jan 22 2015 /epics [xf23id1@xf23id-ws3 ~]$ ls -ald /epics/xf drwxr-xr-x 2 root 4.0K Jan 22 2015 /epics/xf [xf23id1@xf23id-ws3 ~]$ ls -ald /epics/xf/23id lrwxrwxrwx 1 root 14 Jan 22 2015 /epics/xf/23id -> /direct/xf23id

[xf23id1@xf23id-ws3 ~]$ df /direct/ Filesystem 1K-blocks Used Available Use% Mounted on auto.direct 0 0 0 - /direct

[xf23id1@xf23id-ws3 ~]$ ls -ald /direct/ drwxr-xr-x 3 root 0 Jul 5 15:18 /direct/ [xf23id1@xf23id-ws3 ~]$ ls -ald /direct/xf23id drwxr-xr-x 9 root 120 Aug 8 2016 /direct/xf23id

[xf23id1@xf23id-ws3 ~]$ df /direct/xf23id Filesystem 1K-blocks Used Available Use% Mounted on 10.0.137.52:/xf/23id 8781121536 7753373024 1027748512 89% /direct/xf23id

[xf23id1@xf23id-ws3 ~]$ host 10.0.137.52 52.137.0.10.in-addr.arpa domain name pointer nfsapp.cs.nsls2.local.

[xf23id1@xf23id-ws3 ~]$ ls -ald /direct/xf23id/ipython drwxrwsrwx 29 xf23id1 4.0K Jul 5 14:53 /direct/xf23id/ipython [xf23id1@xf23id-ws3 ~]$ ls -ald /direct/xf23id/ipython/profile_collection lrwxrwxrwx 1 xf23id1 50 Mar 5 2016 /direct/xf23id/ipython/profile_collection -> /home/xf23id1/xf23id1_profiles/profile_collection/

[xf23id1@xf23id-ws3 ~]$ ls -ald /home/xf23id1 drwxr-xr-x 97 xf23id1 28K Jul 6 10:21 /home/xf23id1 [xf23id1@xf23id-ws3 ~]$ ls -ald /home/xf23id1/xf23id1_profiles drwxrwxr-x 4 xf23id1 42 Mar 5 2016 /home/xf23id1/xf23id1_profiles [xf23id1@xf23id-ws3 ~]$ ls -ald /home/xf23id1/xf23id1_profiles/profile_collection

almost expected some more symlinks here!

[xf23id1@xf23id-ws3 ~]$ find /home/xf23id1/xf23id1_profiles/profile_collection -xdev -type l [xf23id1@xf23id-ws3 ~]$

phew!