AgnostiqHQ / covalent-slurm-plugin

Executor plugin interfacing Covalent with Slurm
https://covalent.xyz
Apache License 2.0
27 stars 6 forks source link

v0.18.0 appears to be broken: no `sbatch` of jobs #92

Closed Andrew-S-Rosen closed 7 months ago

Andrew-S-Rosen commented 7 months ago

Environment

What is happening?

I can get the covalent-slurm-plugin to work fine with 0.16.0 but not with 0.18.0. I imagine the refactoring effort introduced a bug. Running any minimal example with the covalent-slurm-plugin yields an error akin to

scp: /global/homes/r/rosen/quacc/7f078879-ee9e-45a4-960b-98839dfdb1b8/node_0/stdout-7f078879-ee9e-45a4-960b-98839dfdb1b8-0.log: No such file or directory

with the following log

 Exception in ASGI application

Traceback (most recent call last):

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 404, in run_asgi

    result = await app(  # type: ignore[func-returns-value]

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__

    return await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__

    await super().__call__(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__

    await self.middleware_stack(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__

    await self.app(scope, receive, _send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__

    await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__

    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app

    await app(scope, receive, sender)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__

    await self.middleware_stack(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 778, in app

    await route.handle(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle

    await self.app(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 79, in app

    await wrap_app_handling_exceptions(app, request)(scope, receive, send)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app

    raise exc

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app

    await app(scope, receive, sender)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/routing.py", line 74, in app

    response = await func(request)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app

    raw_response = await run_endpoint_function(

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function

    return await run_in_threadpool(dependant.call, **values)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool

    return await anyio.to_thread.run_sync(func, *args)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync

    return await get_async_backend().run_sync_in_worker_thread(

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread

    return await future

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run

    result = context.run(func, *args)

  File "/home/rosen/software/miniconda/envs/covalent/lib/python3.10/site-packages/covalent_ui/api/v1/routes/end_points/electron_routes.py", line 216, in get_electron_file

    response, python_object = handler.read_from_serialized(result["results_filename"])

TypeError: cannot unpack non-iterable NoneType object

How can we reproduce the issue?

Use covalent-slurm-plugin==0.18.0 and submit any minimal example like that in the README.

This was my setup for what it's worth:


n_nodes = 1
n_cores_per_node = 1

executor = ct.executor.SlurmExecutor(
    username="rosen",
    address="perlmutter-p1.nersc.gov",
    ssh_key_file="/home/rosen/.ssh/nersc",
    cert_file="/home/rosen/.ssh/nersc-cert.pub",
    conda_env="quacc",
    options={
        "nodes": f"{n_nodes}",
        "qos": "debug",
        "constraint": "cpu",
        "account": "matgen",
        "job-name": "quacc",
        "time": "00:10:00",
    },
    remote_workdir="/pscratch/sd/r/rosen/quacc",
    create_unique_workdir=True,
    use_srun=False,
    cleanup=False,
)

What should happen?

The job should be submitted to the queue. In reality, it never is submitted.

Any suggestions?

No response

Andrew-S-Rosen commented 7 months ago

Closing this to make a better reproduction of my issue.