LiberTEM / LiberTEM

Open pixelated STEM framework
https://libertem.github.io/LiberTEM/
GNU General Public License v3.0
111 stars 68 forks source link

File type "auto" not working on remote executor #727

Closed uellue closed 4 years ago

uellue commented 4 years ago

Setup: Jupyter notebook on Windows, connecting to remote dask cluster on Linux.

After fixing remote opening of BLO files (PR pending), the following error remains when type "auto" is specified:

# Works fine with fix applied
blo_ds = ctx.load("BLO", path='/cachedata/users/weber/data/Glasgow/10 um 110.blo')

# Doesn't work
auto_ds = ctx.load("auto", path='/cachedata/users/weber/data/Glasgow/10 um 110.blo')

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95\xfd\x05\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(\x8c\x08builtins\x94\x8c\x07OSError\x94\x93\x94\x8c.Unable to open file (file signature not found)\x94\x85\x94Nh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x12distributed.worker\x94\x8c\x08__file__\x94\x8c]/cachedata/users/weber/libertem-uellue-venv/lib/python3.6/site-packages/distributed/worker.py\x94u\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h\x14\x8c\x07co_name\x94\x8c\x0eapply_function\x94ububM\xe2\x0ch\n\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h\x0c)\x81\x94}\x94(h\x0f}\x94(\x8c\x08__name__\x94\x8c\x18libertem.io.dataset.hdf5\x94\x8c\x08__file__\x94\x8cJc:\\users\\weber\\documents\\libertem\\libertem\\src\\libertem\\io\\dataset\\hdf5.py\x94uh\x15h\x17)\x81\x94}\x94(h\x1a\x8cJc:\\users\\weber\\documents\\libertem\\libertem\\src\\libertem\\io\\dataset\\hdf5.py\x94h\x1b\x8c\n_do_detect\x94ubub\x8c\ttb_lineno\x94K\x98\x8c\x07tb_next\x94h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11\x8c\x0eh5py._hl.files\x94h\x13\x8cY/cachedata/users/weber/libertem-uellue-venv/lib/python3.6/site-packages/h5py/_hl/files.py\x94uh\x15h\x17)\x81\x94}\x94(h\x1ah5h\x1b\x8c\x08__init__\x94ububh-M\x8a\x01h.h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11h4h\x13h5uh\x15h\x17)\x81\x94}\x94(h\x1ah5h\x1b\x8c\x08make_fid\x94ububh-K\xaah.h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11\x8c\rh5py._objects\x94h\x13\x8cu/cachedata/users/weber/libertem-uellue-venv/lib/python3.6/site-packages/h5py/_objects.cpython-36m-x86_64-linux-gnu.so\x94uh\x15h\x17)\x81\x94}\x94(h\x1a\x8c\x11h5py/_objects.pyx\x94h\x1b\x8c\x1fh5py._objects.with_phil.wrapper\x94ububh-K6h.h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11hFh\x13hGuh\x15h\x17)\x81\x94}\x94(h\x1a\x8c\x11h5py/_objects.pyx\x94h\x1b\x8c\x1fh5py._objects.with_phil.wrapper\x94ububh-K7h.h\x1e)\x81\x94}\x94(h!h\x0c)\x81\x94}\x94(h\x0f}\x94(h\x11\x8c\x08h5py.h5f\x94h\x13\x8cp/cachedata/users/weber/libertem-uellue-venv/lib/python3.6/site-packages/h5py/h5f.cpython-36m-x86_64-linux-gnu.so\x94uh\x15h\x17)\x81\x94}\x94(h\x1a\x8c\x0ch5py/h5f.pyx\x94h\x1b\x8c\rh5py.h5f.open\x94ububh-KUubububububub\x87\x94R\x94t\x94R\x94.'
Traceback (most recent call last):
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\pickle.py", line 59, in loads
    return pickle.loads(x)
AttributeError: Can't get attribute 'unpickle_exception' on <module 'tblib.pickling_support' from 'c:\\users\\weber\\.conda\\envs\\libertem36\\lib\\site-packages\\tblib\\pickling_support.py'>
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\core.py", line 124, in loads
    value = _deserialize(head, fs, deserializers=deserializers)
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\serialize.py", line 268, in deserialize
    return loads(header, frames)
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\serialize.py", line 62, in pickle_loads
    return pickle.loads(b"".join(frames))
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\pickle.py", line 59, in loads
    return pickle.loads(x)
AttributeError: Can't get attribute 'unpickle_exception' on <module 'tblib.pickling_support' from 'c:\\users\\weber\\.conda\\envs\\libertem36\\lib\\site-packages\\tblib\\pickling_support.py'>
distributed.utils - ERROR - Can't get attribute 'unpickle_exception' on <module 'tblib.pickling_support' from 'c:\\users\\weber\\.conda\\envs\\libertem36\\lib\\site-packages\\tblib\\pickling_support.py'>
Traceback (most recent call last):
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\utils.py", line 663, in log_errors
    yield
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\client.py", line 1150, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\comm\tcp.py", line 208, in read
    frames, deserialize=self.deserialize, deserializers=deserializers
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\comm\utils.py", line 65, in from_frames
    res = _from_frames()
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\comm\utils.py", line 51, in _from_frames
    frames, deserialize=deserialize, deserializers=deserializers
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\core.py", line 124, in loads
    value = _deserialize(head, fs, deserializers=deserializers)
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\serialize.py", line 268, in deserialize
    return loads(header, frames)
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\serialize.py", line 62, in pickle_loads
    return pickle.loads(b"".join(frames))
  File "c:\users\weber\.conda\envs\libertem36\lib\site-packages\distributed\protocol\pickle.py", line 59, in loads
    return pickle.loads(x)
AttributeError: Can't get attribute 'unpickle_exception' on <module 'tblib.pickling_support' from 'c:\\users\\weber\\.conda\\envs\\libertem36\\lib\\site-packages\\tblib\\pickling_support.py'>

The message in the dask worker shell:

distributed.worker - WARNING -  Compute Failed
Function:  _do_detect
args:      ()
kwargs:    {}
Exception: OSError('Unable to open file (file signature not found)',)
uellue commented 4 years ago

As a comment, I found it hard to follow what exactly happens where when datasets are autodetected, with frequent jumps between code that is run on the executor and code that runs on the control node. Refs #518

uellue commented 4 years ago

The issue appears in the GUI as well. The detection jumps to "RAW" for all file types. Specifying the correct parameters (type etc) works for EMPAD and BLO with the fixes of #728 applied.

sk1p commented 4 years ago

Can you have a look at tblib.pickling_support and see if either a) you have a different version installed on moellenstedt than on your local PC, or b) the module is different for win/Linux platforms.

uellue commented 4 years ago

I have the same version 1.6.0 of tblib on both systems. They both have the tblib.pickling_support.unpickle_exception attribute. Is there a way to get more diagnostics? I couldn't find out how.

The core of the issue seems to be that somewhere somehow a function that is run with run_function() on the executor triggers an uncaught OSError exception, which is then tripping up tblib, right? Perhaps the issue is that the OSError is somehow platform-dependent and can't be reconstructed properly?

sk1p commented 4 years ago

Perhaps the issue is that the OSError is somehow platform-dependent and can't be reconstructed properly?

Hmm, possibly. I guess this could be fixed by using more explicit messaging, instead of relying on exception serialization.

As a comment, I found it hard to follow what exactly happens where when datasets are autodetected, with frequent jumps between code that is run on the executor and code that runs on the control node. Refs #518

Totally agree. I think this will become much cleaner when/if we decide to implement a different RPC mechanism - most likely that will force us to implement a much cleaner RPC layer anyways (related to #199)

uellue commented 4 years ago

OSError is indeed platform-dependent: https://docs.python.org/3/library/exceptions.html#OSError

What about including a platform-independent wrapper exception for OSError and making sure we catch, re-package and reraise any OSError in functions that run on the executor? Can probably be done with a decorator?

sk1p commented 4 years ago

Maybe a DataSetDetectFail exception subclassing DataSetException?

Instead of using a decorator, we could also have a wrapper method in the DataSet base class, which calls the underlying implementation and converts any exception into a DataSetDetectFail. That way we don't have to sprinkle decorators all over the place :grinning:

uellue commented 4 years ago

Actually, it is sufficient to define functions that are part of a module instead of lambda or nested functions for all platform-dependent code that should run on a remote executor. I've documented that as a tip in #734.