djgroen / FabSim3

Python 3 version of FabSim
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

machines.py: KeyError in user_config[machine_name] #305

Open Thinkpiet opened 6 days ago

Thinkpiet commented 6 days ago

Hi,

with a fresh FabSim3 install, if I execute fabsim hsuper stat then I get this error:

Traceback (most recent call last):
  File "/home/piet/repos/FabSim3/fabsim/bin/fabsim", line 46, in <module>
    sys.exit(fabsim_main.main())
             ^^^^^^^^^^^^^^^^^^
  File "/home/piet/repos/FabSim3/fabsim/base/fabsim_main.py", line 156, in main
    load_machine(env.host)
  File "<@beartype(fabsim.deploy.machines.load_machine) at 0x7ad61790b920>", line 32, in load_machine
  File "/home/piet/repos/FabSim3/fabsim/deploy/machines.py", line 139, in load_machine
    env.modules.update(user_config[machine_name].get("modules", {}))
                       ~~~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'hsuper'

This makes sense because I did not configure the machine in my fabsim/deploy/machines_user.yml. However, since the machine is listed in fabsim/deploy/machines.yml and also is in the output of fabsim -l machines, I think that this exception should be caught and that there should be a reasonable error message, instead of FabSim just crashing here. For example, for fabsim doesnotexist stat I get this instead

ValueError: The requested remote machine "doesnotexist" did not listed in the machines.yml file, so it can not be used as a target remote host.

The available remote machines are : 
dict_keys(['localhost', 'cartesius', 'prometheus', 'cirrus', 'supermuc2', 'ohm', 'Kathleen', 'myriad', 'bluejoule', 'bluewonder2', 'oppenheimer', 'qcg', 'eagle_vecma', 'eagle_hidalgo', 'eagle_seavea', 'supermuc_vecma', 'training_hidalgo', 'archer2', 'marenostrum4', 'piz_daint', 'monsoonfab', 'hsuper'])

which is a lot better. Thus, I would expect a similar error message here, that asks me to update fabsim/deploy/machines_user.yml.

Best, Piet

Platform details (optional)

mzrghorbani commented 5 days ago

Hi team,

I’ve updated the load_machine function to handle cases where a machine is defined in machines.yml but lacks configuration in machines_user.yml. Below is the modified function with comments explaining each part of the code.

def load_machine(machine_name: str) -> None:
    """
    Load the machine-specific configurations.
    Completes additional paths and interpolates them, via
    `complete_environment`.
    """
    # Check if the machine exists in config before proceeding
    if machine_name not in config:
        raise ValueError(
            f'The requested remote machine "{machine_name}" is not listed in `machines.yml`, '
            "so it cannot be used as a target remote host.\n"
            f"Available remote machines are: {list(config.keys())}"
        )

    # If the machine exists in `config`, proceed with loading
    if "import" in config[machine_name]:
        # Config for this machine is based on another
        env.update(config[config[machine_name]["import"]])

        if config[machine_name]["import"] in user_config:
            env.update(user_config[config[machine_name]["import"]])

    env.update(config[machine_name])

    # Check if the machine exists in `user_config`
    if machine_name in user_config:
        env.update(user_config[machine_name])
    else:
        # Raise a descriptive error if it's missing in `machines_user.yml`
        raise ValueError(
            f'The machine "{machine_name}" is listed in `machines.yml` but is not configured in `machines_user.yml`. '
            "Please add it there to use it as a target remote host."
        )

    env.machine_name = machine_name

    # Construct modules environment: update, not replace when overrides are done.
    env.modules = config["default"]["modules"]
    if "import" in config[machine_name]:
        env.modules.update(config[config[machine_name]["import"]].modules)
    env.modules.update(config[machine_name].get("modules", {}))

    if "import" in config[machine_name]:
        if config[machine_name]["import"] in user_config:
            env.modules.update(
                user_config[config[machine_name]["import"]].modules
            )

    # Add modules from user_config if available
    if machine_name in user_config:
        env.modules.update(user_config[machine_name].get("modules", {}))

    complete_environment()

Summary of Changes Machine Existence Check in config: Before loading configurations, the function checks if the specified machine (machine_name) exists in config. If not, it raises a ValueError and lists available machines.

Error Message for Missing user_config Entry: If the machine exists in machines.yml but is missing in machines_user.yml, the function raises a ValueError instructing the user to add the machine configuration to machines_user.yml.

Modular Environment Construction: The function updates the env.modules settings without replacing overrides, ensuring any imported configurations are properly included.

I’ve tested these changes locally, and they effectively prevent crashes from missing machine configurations.

Would appreciate any feedback or additional testing!