ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

Intermittent Triton Kernel Compilation Failure in Fast-LLM Due to Stale File Handle (Errno 116) #45

Open tscholak opened 6 days ago

tscholak commented 6 days ago

🐞 Describe the Bug

The Fast-LLM training process intermittently fails during Triton kernel compilation with an OSError: [Errno 116] Stale file handle. The issue appears to occur when Triton attempts to read intermediate files during the kernel compilation process (triton/compiler/compiler.py). This problem is not consistently reproducible but has occurred twice in one day.

🔄 Steps to Reproduce

The issue does not appear to be tied to specific training steps, commands, or configurations. It occurs sporadically during initialization before the training begins. The Fast-LLM Docker image used was ghcr.io/servicenow/fast-llm:sha-f4053af.

Here's the relevant log excerpt:

2024-11-15 23:34:53,145 [Rank 02] Traceback (most recent call last):
  File "/app/fast_llm/tools/cli.py", line 29, in fast_llm
    Runnable.parse_and_run(unparsed)
  File "/app/fast_llm/engine/config_utils/runnable.py", line 36, in parse_and_run
    runnable()
  File "/app/fast_llm/engine/training/config.py", line 373, in runnable
    trainer.run()
  File "/app/fast_llm/engine/training/trainer.py", line 141, in run
    self._run_training()
  File "/app/fast_llm/engine/training/trainer.py", line 144, in _run_training
    self._prepare_training_state()
  File "/app/fast_llm/engine/training/trainer.py", line 389, in _prepare_training_state
    self._multi_stage.initialize_weights()
  File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 94, in initialize_weights
    self._finalize_load(reset_optimizer=True)
  File "/app/fast_llm/engine/multi_stage/fast_llm_model.py", line 98, in _finalize_load
    triton_fill(self._state_shard[1:], 0.0)
  File "/app/fast_llm/functional/triton/pointwise.py", line 75, in triton_fill
    triton_fill_kernel[grid](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 180, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 401, in run
    self.cache[device][key] = compile(
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 283, in compile
    return CompiledKernel(src, metadata_group, hash)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 314, in __init__
    self.asm = {
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 315, in <dictcomp>
    file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.active.binary_ext else file.read_text()
  File "/usr/lib/python3.10/pathlib.py", line 1135, in read_text
    return f.read()
OSError: [Errno 116] Stale file handle

🎯 Expected Behavior

The Triton kernel compilation should complete successfully, and the training process should proceed without errors.

📜 Environment Information

Since the issue occurs during initialization and seems unrelated to hardware or system configuration, this section has been omitted for brevity.

📝 Additional Context