The Fast-LLM training process intermittently fails during Triton kernel compilation with an OSError: [Errno 116] Stale file handle. The issue appears to occur when Triton attempts to read intermediate files during the kernel compilation process (triton/compiler/compiler.py). This problem is not consistently reproducible but has occurred twice in one day.
🔄 Steps to Reproduce
The issue does not appear to be tied to specific training steps, commands, or configurations. It occurs sporadically during initialization before the training begins. The Fast-LLM Docker image used was ghcr.io/servicenow/fast-llm:sha-f4053af.
Here's the relevant log excerpt:
2024-11-15 23:34:53,145 [Rank 02] Traceback (most recent call last):
File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
Runnable.parse_and_run(unparsed)
File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
runnable()
File "/app/fast_llm/engine/training/config.py", line 373, in runnable
trainer.run()
File "/app/fast_llm/engine/training/trainer.py", line 141, in run
self._run_training()
File "/app/fast_llm/engine/training/trainer.py", line 144, in _run_training
self._prepare_training_state()
File "/app/fast_llm/engine/training/trainer.py", line 389, in _prepare_training_state
self._multi_stage.initialize_weights()
File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 94, in initialize_weights
self._finalize_load(reset_optimizer=True)
File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 98, in _finalize_load
triton_fill(self._state_shard[1:], 0.0)
File "/app/fast_llm/functional/triton/pointwise.py", line 75, in triton_fill
triton_fill_kernel[grid](
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 180, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 401, in run
self.cache[device][key] = compile(
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 283, in compile
return CompiledKernel(src, metadata_group, hash)
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 314, in __init__
self.asm = {
File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 315, in <dictcomp>
file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.active.binary_ext else file.read_text()
File "/usr/lib/python3.10/pathlib.py", line 1135, in read_text
return f.read()
OSError: [Errno 116] Stale file handle
🎯 Expected Behavior
The Triton kernel compilation should complete successfully, and the training process should proceed without errors.
📜 Environment Information
Since the issue occurs during initialization and seems unrelated to hardware or system configuration, this section has been omitted for brevity.
📝 Additional Context
Potential Cause: The error ([Errno 116] Stale file handle) suggests an underlying file system issue, possibly due to:
Shared file system synchronization issues (e.g., NFS or similar setups).
Temporary file cleanup conflicts during parallel operations.
🐞 Describe the Bug
The Fast-LLM training process intermittently fails during Triton kernel compilation with an
OSError: [Errno 116] Stale file handle
. The issue appears to occur when Triton attempts to read intermediate files during the kernel compilation process (triton/compiler/compiler.py
). This problem is not consistently reproducible but has occurred twice in one day.🔄 Steps to Reproduce
The issue does not appear to be tied to specific training steps, commands, or configurations. It occurs sporadically during initialization before the training begins. The Fast-LLM Docker image used was
ghcr.io/servicenow/fast-llm:sha-f4053af
.Here's the relevant log excerpt:
🎯 Expected Behavior
The Triton kernel compilation should complete successfully, and the training process should proceed without errors.
📜 Environment Information
Since the issue occurs during initialization and seems unrelated to hardware or system configuration, this section has been omitted for brevity.
📝 Additional Context
[Errno 116] Stale file handle
) suggests an underlying file system issue, possibly due to: