MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
144 stars 47 forks source link

[next branch] Issue using the Slurm executor #731

Closed hadim closed 1 year ago

hadim commented 1 year ago

The 'local' one works perfectly and I am now trying to setup the SLURM one.

The config is quite standard:

# How and where to detect the QM softwares.
environments:
  use_manager_environment: true
  conda: []
  apptainer: []

executors:
  slurm:
    type: slurm

    # Common to all executors.
    queue_tags: ["demo_hadrien"]
    # worker_init: ["micromamba activate openfractal"]
    worker_init: []
    scratch_directory: null
    bind_address: 127.0.0.1
    cores_per_worker: 16
    memory_per_worker: 16 # GB
    extra_executor_options: {}

    # Specific options for the local executor.
    walltime: 100:00:00
    exclusive: false
    partition: null
    account: null
    workers_per_node: 1
    max_nodes: 1
    scheduler_options: []

When there is no job to exectute qcfractal-compute-manager regulary log the same error every few seconds:

[2023-06-24 09:00:22 EDT]     INFO: parsl.dataflow.dflow: Parsl version: 2023.06.19
[2023-06-24 09:00:22 EDT]     INFO: parsl.dataflow.dflow: Run id is: 4a8b2ec6-eea4-471f-9698-b6bb25035c1c
[2023-06-24 09:00:23 EDT]     INFO: parsl.dataflow.memoization: App caching initialized
[2023-06-24 09:00:23 EDT]     INFO: parsl.executors.status_handling: Scaling out by 1 blocks
[2023-06-24 09:00:23 EDT]     INFO: parsl.executors.status_handling: Allocated block ID 0
[2023-06-24 09:00:23 EDT]     INFO: ComputeManager: Compute Manager successfully started.
[2023-06-24 09:00:23 EDT]     INFO: ComputeManager: Task Stats: Total finished=0, Failed=0, Success=0, Rejected=0
[2023-06-24 09:00:23 EDT]     INFO: ComputeManager: Worker Stats (est.): Core Hours Used=0.00
[2023-06-24 09:00:23 EDT]     INFO: ComputeManager: Executor slurm has 0 active tasks and 3 open slots
[2023-06-24 09:00:23 EDT]     INFO: ComputeManager: Acquired 0 new tasks.
[2023-06-24 09:00:24 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:29 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:34 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:39 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:44 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:49 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:00:53 EDT]     INFO: ComputeManager: Task Stats: Total finished=0, Failed=0, Success=0, Rejected=0
[2023-06-24 09:00:53 EDT]     INFO: ComputeManager: Worker Stats (est.): Core Hours Used=0.00
[2023-06-24 09:00:53 EDT]     INFO: ComputeManager: Executor slurm has 0 active tasks and 3 open slots
[2023-06-24 09:00:53 EDT]     INFO: ComputeManager: Acquired 0 new tasks.
[2023-06-24 09:00:54 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

Now if the manager has task to execute I still see the same errors but also new ones:

[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Parsl version: 2023.06.19
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Run id is: 357dcb6b-120b-4711-9bd6-9d235148f602
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.memoization: App caching initialized
[2023-06-24 09:01:44 EDT]     INFO: parsl.executors.status_handling: Scaling out by 1 blocks
[2023-06-24 09:01:44 EDT]     INFO: parsl.executors.status_handling: Allocated block ID 0
[2023-06-24 09:01:44 EDT]     INFO: ComputeManager: Compute Manager successfully started.
[2023-06-24 09:01:44 EDT]     INFO: ComputeManager: Task Stats: Total finished=0, Failed=0, Success=0, Rejected=0
[2023-06-24 09:01:44 EDT]     INFO: ComputeManager: Worker Stats (est.): Core Hours Used=0.00
[2023-06-24 09:01:44 EDT]     INFO: ComputeManager: Executor slurm has 0 active tasks and 3 open slots
[2023-06-24 09:01:44 EDT]     INFO: ComputeManager: Acquired 3 new tasks.
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Task 0 submitted for App wrapper, not waiting on any dependency
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Parsl task 0 try 0 launched on executor slurm with executor id 1
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Task 1 submitted for App wrapper, not waiting on any dependency
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Parsl task 1 try 0 launched on executor slurm with executor id 2
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Task 2 submitted for App wrapper, not waiting on any dependency
[2023-06-24 09:01:44 EDT]     INFO: parsl.dataflow.dflow: Parsl task 2 try 0 launched on executor slurm with executor id 3
[2023-06-24 09:01:45 EDT]    ERROR: parsl.executors.status_handling: Setting bad state due to exception
Exception: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:01:45 EDT]    ERROR: parsl.dataflow.dflow: Task 0 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 300, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 564, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.errors.BadStateException: Executor slurm failed due to: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:01:45 EDT]    ERROR: parsl.dataflow.dflow: Task 1 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 300, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 564, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.errors.BadStateException: Executor slurm failed due to: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:01:45 EDT]    ERROR: parsl.dataflow.dflow: Task 2 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 300, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/site-packages/parsl/dataflow/dflow.py", line 564, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/hadrien/local/micromamba/envs/openfractal/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.errors.BadStateException: Executor slurm failed due to: 1. Failed to start block 0: not enough values to unpack (expected 3, got 1)

[2023-06-24 09:01:45 EDT]     INFO: parsl.executors.status_handling: Scaling out by 1 blocks
[2023-06-24 09:01:45 EDT]     INFO: parsl.executors.status_handling: Allocated block ID 1
^C[2023-06-24 09:01:46 EDT]     INFO: root: Received signal SIGINT, shutting down
[2023-06-24 09:01:46 EDT]     INFO: root: Received signal SIGINT, shutting down

I tried to dig within the source code and also in the parsl source code but so far couldn't solve it.

If you have an idea, let me know.

hadim commented 1 year ago

Sorry for the noise but maybe it will be useful to others actually.

The error was a silly one due to wrong YAML parsing logic: walltime: 100:00:00 instead of walltime: "100:00:00"

bennybp commented 1 year ago

Not really silly. I had the same issue. So since 2/2 people who have used the slurm executor have done the same thing, I need to fix that :)

The python yaml package turns times with colons into seconds. So I should be able to convert it back in the pydantic model

hadim commented 1 year ago

Good to hear I am not alone xD

bennybp commented 1 year ago

Addressed in https://github.com/MolSSI/QCFractal/commit/45ef30ffca29567fdd32e42a2f7356cc9001141c