emorice / galp

Incremental distributed python runner
MIT License
0 stars 0 forks source link

Logserver crash #106

Open emorice opened 2 weeks ago

emorice commented 2 weeks ago
 Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/pool/__main__.py", line 14, in <module>
    main(vars(_parser.parse_args()))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/_pool.py", line 333, in main
    forkserver(sock_server, sock_logserver, signal_read_fd, config)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/_pool.py", line 289, in forkserver
    leave = callback()
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/logserver.py", line 140, in <lambda>
    lambda filed=filed, tee_file=tee_file: on_stream_msg(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/logserver.py", line 112, in on_stream_msg
    tee_file.write(item)
ValueError: I/O operation on closed file
emorice commented 2 weeks ago

We just had an other, somehow similar:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/pool/__main__.py", line 14, in <module>
    main(vars(_parser.parse_args()))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/_pool.py", line 329, in main
    forkserver(sock_server, sock_logserver, signal_read_fd, config)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/_pool.py", line 285, in forkserver
    leave = callback()
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/logserver.py", line 140, in <lambda>
    lambda filed=filed, tee_file=tee_file: on_stream_msg(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/galp/logserver.py", line 109, in on_stream_msg
    item = os.read(filed, 4096)
OSError: [Errno 9] Bad file descriptor
emorice commented 2 weeks ago

I got an other puzzling fd-linked issue:

[...]
galp/protocol.py:95: in on_message
      return load_routed(msg).then(
  galp/result.py:23: in then
      return function(self.value)
  galp/protocol.py:96: in <lambda>
      lambda routed: app_handler(
  galp/protocol.py:115: in _on_message
      return app_handler(sessions.reply_from(None), msg)
  galp/_client.py:62: in on_message
      self._script.done(msg.request, msg.value)
  galp/asyn.py:360: in done
      _prim.progress(result.status)
  galp/asyn.py:295: in progress
      callback(status)
  galp/commands.py:319: in <lambda>
      lambda status: _progress_submit(task_def, status, options)
  galp/commands.py:309: in _progress_submit
      options.printer.update_task_output(task_def, status)
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

  self = <galp.printer.PassTroughPrinter object at 0x7f89d0a5ce00>
  _task_def = CoreTaskDef(name=10c6c1d, scatter=None, step='tests.steps::identity', args=[TaskInput(op=$sub, name=325df7b)], kwargs={}, vtags=[], resources=Resources(cpus=1, vm='', cpu_list=()))
  status = b'1234\n'

      def update_task_output(self, _task_def: gtt.CoreTaskDef, status: bytes):
  >       os.write(sys.stdout.fileno(), status)
  E       io.UnsupportedOperation: fileno

  galp/printer.py:39: UnsupportedOperation

The common thread between all of these seem to be file-base communication with the outer world randomly failing, but maybe only in either github CI or pytest which both may do some magic with the current process. So it's maybe not really a real problem for us, and rather a test environment quirk.