Open rbeucher opened 1 year ago
Known issue I'm afraid. The root cause is this:
OSError: [Errno 5] Input/output error
There is something broken about how the linux kernel is handling the squashfs loopback device. As far as I can tell, its some kind of unhandled cache miss. I've been trying on and off to figure out a workaround for it, but nothing has helped yet. The shared loop device setting in singularity reduced the occurrence of this error, but it still shows up from time to time, as you've seen. I used to have a semi-consistent reproducer, but when NCI changed the shared loopback setting reduced the OSError rate to about 1/1000 python
launches. Insofar as I can tell, it only happens during python imports. I've managed to get into a jupyter session that was failing to import some module for this reason once. I did manage to clear it, but I'm not sure exactly what I did that fixed it. That was a few months ago and I've not managed to get an interactive session with this problem since, in spite of using the squashfs envs pretty much every day. How often do you see this?
Thanks.
I only use the environment via PBS and have never encountered that pb. It's a different story with ARE. @dougiesquire and @headmetal have had regular issues.
@dougiesquire and @headmetal any comments?
Over the last month or so, I'd say I'm getting this error (or variants of this error that contain the OSError: [Errno 5] Input/output error
) around 1/5 ARE Jupyter
sessions using the same conda environment as in @rbeucher's post above.
Admittedly i haven't been running these sessions that much over the last week or so, so haven't had any recently - but will track it over the next few weeks and report back.
Andy mentioned the issue too.
Interesting. That's way more prevalent than in the hh5 squashfs envs. Can I join xp65 and do some tests with your environment?
Please 🙏 😁
I've had this happen multiple times today so it should be fairly easy to reproduce. Even if it doesn't occur on start-up, the error will sometimes occur partway through a JupyterLab session when a cell is run.
@dougiesquire I've noticed the prevalence of this issue is dependent on the workflow. I thnk (but am not certain) the more modules get imported, especially in parallel e.g. during dask cluster startup, the more likely it is to show up. Would I be able to have a copy of whatever you were running yesterday when this kept happening? Doesn't need to be a minimal reproducer, in fact, the more complicated the better. Can you also let me know which version of conda/access-med
and what size ARE instance you were using?
Sure - I was getting it fairly often when running this notebook: https://github.com/ACCESS-NRI/NRI-Workshop2023-MED/blob/main/Intake_tutorial_p1.ipynb
I was using conda/are
with an X-Large
instance on normalbw
.
OK, I've made some progress on this. I managed to set up a stress test that would fail with this error about 80% of the time. dmesg
kept complaining about the fragment cache, so I figured lets disable fragmentation in the squashfs and also simplify the internal structure of the squashfs as much as possible and see what happens. That dropped the failure rate to about 20%, but the failures became SIGBUS errors where data was not able to be read. I don't know if this is better or not, but I've updated the arguments to mksquashfs
in https://github.com/coecms/cms-conda-singularity/commit/a17471948bd531461330f70d5e4b2c2f1a39641f and I'll see what effect that has over time. I'm wondering if there are mount options I can change that may have an effect as well. Perhaps I can switch from an overlay to a bind and that might help? I'm just flipping switches at this point, I really don't have any clue on the underlying mechanism that's causing this.
Thanks Dale - 20% sounds a lot better than 80% already
Hi @dsroberts
We have been encountering that issue in the last few weeks:
Here is the full log output: Looks like a pb with the file system. Any idea?
Thanks!