galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Remote tool evaluation does not work in pulsar #16744

Open sanjaysrikakulam opened 11 months ago

sanjaysrikakulam commented 11 months ago

Describe the bug At EU, we are testing the deferred dataset in combination with Pulsar (including embedded Pulsar), it does not seem to work, and jobs fail.

I am running a test Galaxy (EU replica) instance with TPV, and I have also connected it with a Pulsar instance for testing.

I have tested the following combinations locally via HTCondor, Pulsar, and Pulsar embedded.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

**Local (local job destination with HTCondor; runner: CondorJobRunner)** |   |   |   |   -- | -- | -- | -- | -- Job id | Deferred data | Metadata strategy | Tool eval | Job successfully ran 35 | No | Extended | Remote | Yes 36 | Yes | Extended | Remote | Yes   |   |   |   |   **Pulsar destination (runner: pulsar)** |   |   |   |   Job id | Deferred data | Metadata strategy | Tool eval | Job successfully ran 31 | No | Directory | Remote | Yes 34 | No | Extended | Local | Yes 37 | No | Extended | Remote | No 32 | Yes | Directory | Remote | No 38 | Yes | Extended | Remote | No   |   |   |   |   **Embedded Pulsar destination (runner: PulsarEmbeddedJobRunner)** |   |   |   |   Job id | Deferred data | Metadata strategy | Tool evaluation strategy | Job successfully ran 42 | Yes | Extended | Remote | No 43 | No | Extended | Remote | No 44 | Yes | Directory | Local | No 45 | No | Directory | Local | Yes 46 | No | Extended | Local | Yes 47 | No | Extended | Remote | No

Galaxy Version and/or server at which you observed the bug Galaxy Version: 23.1

To Reproduce

  1. Assuming the following:
    1. Galaxy 23.1
    2. TPV, and a job scheduler is configured and connected to Galaxy
    3. The Pulsar instance is created, configured, and connected to the Galaxy instance
    4. Non-deferred dataset jobs are getting scheduled in local, pulsar, and embedded pulsar destinations and successfully finishing.
  2. Add metadata_strategy: extended and tool_evaluation_strategy: remote to the params of the respective destinations (local, pulsar, or embedded_pulsar) in TPV.
  3. Reload Galaxy handlers.

Expected behavior The deferred dataset is fetched on the Pulsar instance, and the job runs successfully.

Tracebacks

  1. Traceback of job id: 32 (Deferred data: Yes, Metadata strategy: directory, Tool eval: remote, Pulsar: Yes)
/pulsar_data/staging/ps01/32/inputs/dataset_ee076508-67fe-4314-be18-111ea2a7bd5f.dat path does not exist.
  1. Traceback of job id: 37 (Deferred data: No, Metadata strategy: extended, Tool eval: remote, Pulsar: Yes)
Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 694, in finish_job
    job_wrapper.finish(
  File "/opt/galaxy/server/lib/galaxy/jobs/__init__.py", line 1922, in finish
    import_model_store.perform_import(history=job.history, job=job)
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 377, in perform_import
    datasets_attrs = self.datasets_properties()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 1490, in datasets_properties
    datasets_attrs = load(open(datasets_attrs_file_name))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/main/000/37/metadata/outputs_populated/datasets_attrs.txt'

_Note: On the Pulsar instance, found the following python: can't open file 'None/galaxy/tools/remote_tool_eval.py': [Errno 2] No such file or directory on the /pulsar_data/staging/ps01/37/metadata/job_stderr, and the tool_stdout and tool_stderr files were not found_

  1. Traceback of job id: 38 (Deferred data: Yes, Metadata strategy: extended, Tool eval: remote, Pulsar: Yes)
Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 694, in finish_job
    job_wrapper.finish(
  File "/opt/galaxy/server/lib/galaxy/jobs/__init__.py", line 1922, in finish
    import_model_store.perform_import(history=job.history, job=job)
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 377, in perform_import
    datasets_attrs = self.datasets_properties()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 1490, in datasets_properties
    datasets_attrs = load(open(datasets_attrs_file_name))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/main/000/38/metadata/outputs_populated/datasets_attrs.txt'

_Note: On the Pulsar instance, the input dataset file is empty in the job working directory, and found the following python: can't open file 'None/galaxy/tools/remote_tool_eval.py': [Errno 2] No such file or directory on the /pulsar_data/staging/ps01/38/metadata/job_stderr, also the tool_stdout and tool_stderr files were not found_

  1. Traceback of job id: 42 (Deferred data: Yes, Metadata strategy: extended, Tool eval: remote, Embedded Pulsar: Yes)
Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 694, in finish_job
    job_wrapper.finish(
  File "/opt/galaxy/server/lib/galaxy/jobs/__init__.py", line 1922, in finish
    import_model_store.perform_import(history=job.history, job=job)
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 377, in perform_import
    datasets_attrs = self.datasets_properties()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 1490, in datasets_properties
    datasets_attrs = load(open(datasets_attrs_file_name))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/main/000/42/metadata/outputs_populated/datasets_attrs.txt'

5.Traceback of job id: 43 (Deferred data: No, Metadata strategy: extended, Tool eval: remote, Embedded Pulsar: Yes)

Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 694, in finish_job
    job_wrapper.finish(
  File "/opt/galaxy/server/lib/galaxy/jobs/__init__.py", line 1922, in finish
    import_model_store.perform_import(history=job.history, job=job)
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 377, in perform_import
    datasets_attrs = self.datasets_properties()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 1490, in datasets_properties
    datasets_attrs = load(open(datasets_attrs_file_name))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/main/000/43/metadata/outputs_populated/datasets_attrs.txt'

_Note: Found the following traceback in the job_stderr file_


Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/metadata/set_metadata.py", line 152, in get_metadata_params
    with open(metadata_params_path) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/pulsar_staging/43/working/../metadata/params.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/tools/remote_tool_eval.py", line 131, in <module>
    main(TMPDIR, WORKING_DIRECTORY, IMPORT_STORE_DIRECTORY)
  File "/opt/galaxy/server/lib/galaxy/tools/remote_tool_eval.py", line 75, in main
    metadata_params = get_metadata_params(WORKING_DIRECTORY)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/metadata/set_metadata.py", line 155, in get_metadata_params
    raise Exception(f"Failed to find metadata/params.json from cwd [{tool_job_working_directory}]")
Exception: Failed to find metadata/params.json from cwd [/data/twd01/pulsar_staging/43/working/..]
  1. Traceback of job id: 44 (Deferred data: Yes, Metadata strategy: directory, Tool eval: local, Embedded Pulsar: Yes)
/data/twd01/pulsar_staging/44/inputs/dataset_ee076508-67fe-4314-be18-111ea2a7bd5f.dat path does not exist.
  1. Traceback of job id: 47 (Deferred data: No, Metadata strategy: extended, Tool eval: remote, Embedded Pulsar: Yes)
Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 694, in finish_job
    job_wrapper.finish(
  File "/opt/galaxy/server/lib/galaxy/jobs/__init__.py", line 1922, in finish
    import_model_store.perform_import(history=job.history, job=job)
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 377, in perform_import
    datasets_attrs = self.datasets_properties()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/model/store/__init__.py", line 1490, in datasets_properties
    datasets_attrs = load(open(datasets_attrs_file_name))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/main/000/47/metadata/outputs_populated/datasets_attrs.txt'

_Note: Found the following traceback in job_stderr file_

Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/metadata/set_metadata.py", line 152, in get_metadata_params
    with open(metadata_params_path) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/twd01/pulsar_staging/47/working/../metadata/params.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/galaxy/server/lib/galaxy/tools/remote_tool_eval.py", line 131, in <module>
    main(TMPDIR, WORKING_DIRECTORY, IMPORT_STORE_DIRECTORY)
  File "/opt/galaxy/server/lib/galaxy/tools/remote_tool_eval.py", line 75, in main
    metadata_params = get_metadata_params(WORKING_DIRECTORY)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/galaxy/server/lib/galaxy/metadata/set_metadata.py", line 155, in get_metadata_params
    raise Exception(f"Failed to find metadata/params.json from cwd [{tool_job_working_directory}]")
Exception: Failed to find metadata/params.json from cwd [/data/twd01/pulsar_staging/47/working/..]

Additional info

  1. I compared the directory structure between local and deferred dataset jobs, and there isn't much difference.
  2. When tool_evaluation_strategy: remote is set, I do not see how the data is being fetched. I do not find any "download" code or snippet in the submission script or the tool_script.sh. Also, I do not see anything being downloaded for that job in the job handler logs. So, I do not understand how the data is being fetched when the deferred dataset is used (I have not dug through the code yet).
  3. I also defined this metadata_strategy: extended and tool_evaluation_strategy: remote globally on the galaxy.yml to verify whether the behavior changes but apparently not.

Questions/Observations

  1. How does the deferred dataset get downloaded when tool_evaluation_strategy: remote? A code snippet part of the job submission script or the tool script should exist, right?
  2. Jobs fail, when:
    1. tool_evaluation_strategy: remote is set in combinations with pulsar or embedded pulsar, and dataset local or deferred
    2. Deferred dataset in general with pulsar or embedded pulsar

I'd be happy to share the test instance for testing this.

sanjaysrikakulam commented 11 months ago

Ping @bgruening

bgruening commented 11 months ago

Thanks @sanjaysrikakulam for writing all this down.

mvdbeek commented 11 months ago

We don't test the tool_evaluation_strategy: remote strategy with pulsar, I don't think this is a problem with deferred data per se but with remote tool evaluation. Can you turn that off for now while we work on a fix?

sanjaysrikakulam commented 11 months ago

Thank you! Sure! Anyway, it's only on a private test instance. I look forward to a fix.