coecms / cms-conda-singularity

A repository for the squashfs'd hh5 conda installations.
Apache License 2.0
1 stars 2 forks source link

ARE issue: init_fs_encoding: failed to get the Python codec #40

Open rbeucher opened 1 year ago

rbeucher commented 1 year ago

Hi @dsroberts

We have been encountering that issue in the last few weeks:

Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized

Here is the full log output: Looks like a pb with the file system. Any idea?

u=rwx,g=rwx,o=rx
Initialising new SSH agent...
succeeded
Script starting...
Waiting for Jupyter Notebook server to open port 52454...
TIMING - Starting wait at: Tue Aug  8 15:45:31 AEST 2023
TIMING - Starting main script at: Tue Aug  8 15:45:31 AEST 2023
WARNING: Symbolic version 'conda/access-med' already defined

Loading conda/access-med-0.3
  Loading requirement: singularity
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
TIMING - Starting jupyter at: Tue Aug  8 15:45:31 AEST 2023
+ ssh-agent -- jupyter lab --config=/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/config.py --FileContentsManager.root_dir=/home/563/rb5533/.jupyter-root
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = ':/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3'
  program name = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  isolated = 0
  environment = 1
  user site = 1
  import site = 1
  sys._base_executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.base_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.base_exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.platlibdir = 'lib'
  sys.executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.path = [
    '',
    '/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python310.zip',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1074, in get_data
OSError: [Errno 5] Input/output error
+ die 'ERROR: Jupyter Lab exited with non-zero code (1)'
+ MSG='ERROR: Jupyter Lab exited with non-zero code (1)'
+ '[' '!' -z 'ERROR: Jupyter Lab exited with non-zero code (1)' ']'
+ echo '<div class="error-msg">ERROR: Jupyter Lab exited with non-zero code (1)</div>'
+ kill -9 1269893
-/bin/bash: line 1: 1269893 Killed                  /local/spool/pbs/mom_priv/jobs/92187277.gadi-pbs.SC
++ date
+ echo 'TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023'
TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023

======================================================================================
                  Resource Usage on 2023-08-08 15:45:49:
   Job Id:             92187277.gadi-pbs
   Project:            iq82
   Exit Status:        137 (Linux Signal 9 SIGKILL Kill, unblockable)
   Service Units:      0.00
   NCPUs Requested:    2                      NCPUs Used: 2               
                                           CPU Time Used: 00:00:01        
   Memory Requested:   9.0GB                 Memory Used: 105.45MB        
   Walltime requested: 05:00:00            Walltime Used: 00:00:07        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

Thanks!

dsroberts commented 1 year ago

Known issue I'm afraid. The root cause is this:

OSError: [Errno 5] Input/output error

There is something broken about how the linux kernel is handling the squashfs loopback device. As far as I can tell, its some kind of unhandled cache miss. I've been trying on and off to figure out a workaround for it, but nothing has helped yet. The shared loop device setting in singularity reduced the occurrence of this error, but it still shows up from time to time, as you've seen. I used to have a semi-consistent reproducer, but when NCI changed the shared loopback setting reduced the OSError rate to about 1/1000 python launches. Insofar as I can tell, it only happens during python imports. I've managed to get into a jupyter session that was failing to import some module for this reason once. I did manage to clear it, but I'm not sure exactly what I did that fixed it. That was a few months ago and I've not managed to get an interactive session with this problem since, in spite of using the squashfs envs pretty much every day. How often do you see this?

rbeucher commented 1 year ago

Thanks.

I only use the environment via PBS and have never encountered that pb. It's a different story with ARE. @dougiesquire and @headmetal have had regular issues.

@dougiesquire and @headmetal any comments?

headmetal commented 1 year ago

Over the last month or so, I'd say I'm getting this error (or variants of this error that contain the OSError: [Errno 5] Input/output error) around 1/5 ARE Jupyter sessions using the same conda environment as in @rbeucher's post above.

Admittedly i haven't been running these sessions that much over the last week or so, so haven't had any recently - but will track it over the next few weeks and report back.

rbeucher commented 1 year ago

Andy mentioned the issue too.

dsroberts commented 1 year ago

Interesting. That's way more prevalent than in the hh5 squashfs envs. Can I join xp65 and do some tests with your environment?

rbeucher commented 1 year ago

Please 🙏 😁

dougiesquire commented 1 year ago

I've had this happen multiple times today so it should be fairly easy to reproduce. Even if it doesn't occur on start-up, the error will sometimes occur partway through a JupyterLab session when a cell is run.

dsroberts commented 1 year ago

@dougiesquire I've noticed the prevalence of this issue is dependent on the workflow. I thnk (but am not certain) the more modules get imported, especially in parallel e.g. during dask cluster startup, the more likely it is to show up. Would I be able to have a copy of whatever you were running yesterday when this kept happening? Doesn't need to be a minimal reproducer, in fact, the more complicated the better. Can you also let me know which version of conda/access-med and what size ARE instance you were using?

dougiesquire commented 1 year ago

Sure - I was getting it fairly often when running this notebook: https://github.com/ACCESS-NRI/NRI-Workshop2023-MED/blob/main/Intake_tutorial_p1.ipynb

I was using conda/are with an X-Large instance on normalbw.

dsroberts commented 1 year ago

OK, I've made some progress on this. I managed to set up a stress test that would fail with this error about 80% of the time. dmesg kept complaining about the fragment cache, so I figured lets disable fragmentation in the squashfs and also simplify the internal structure of the squashfs as much as possible and see what happens. That dropped the failure rate to about 20%, but the failures became SIGBUS errors where data was not able to be read. I don't know if this is better or not, but I've updated the arguments to mksquashfs in https://github.com/coecms/cms-conda-singularity/commit/a17471948bd531461330f70d5e4b2c2f1a39641f and I'll see what effect that has over time. I'm wondering if there are mount options I can change that may have an effect as well. Perhaps I can switch from an overlay to a bind and that might help? I'm just flipping switches at this point, I really don't have any clue on the underlying mechanism that's causing this.

headmetal commented 1 year ago

Thanks Dale - 20% sounds a lot better than 80% already