LSSTDESC / desc-gen3-prod

Desc-prod wrapper for pipeline production using gen3_workflow.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Fix use of WorkQueue executor in g3wfpipe and beyond #2

Closed dladams closed 11 months ago

dladams commented 1 year ago

I saw errors earlier when using the WorkQueue executor from the g3wfpipe application. I would like to resolve those and ensure that executor is adequately supported.

dladams commented 1 year ago

I tried running a job with WorkQueue:

Despite the latter, the short run time and the following log snippet indicate the job was not successful.

Submit dir: /pscratch/sd/d/dladams/descprod-jobs/job000437/submit/u/dladams/isr/20230621T161828Z
Run Name: u/dladams/isr/20230621T161828Z
2023/06/21 09:19:13.34 work_queue_python[1060860] notice: Could not create work_queue on port 9000.
Process WorkQueue-Submit-Process:
Traceback (most recent call last):
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/work_queue.py", line 1973, in __init__
    raise Exception('Could not create work_queue on port %d' % port)
Exception: Could not create work_queue on port 9000

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/parsl/process_loggers.py", line 27, in wrapped
    r = func(*args, **kwargs)
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/parsl/executors/workqueue/executor.py", line 867, in _work_queue_submit_wait
    raise e
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/parsl/executors/workqueue/executor.py", line 861, in _work_queue_submit_wait
    q = WorkQueue(port, debug_log=wq_debug_log)
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/work_queue.py", line 1984, in __init__
    raise Exception('Unable to create internal Work Queue structure: %s' % e)
Exception: Unable to create internal Work Queue structure: Could not create work_queue on port 9000
2023-06-21 09:19:13: Check of quantum graph raised an exception.
2023-06-21 09:19:13: Quantum graph was created.
2023-06-21 09:19:13: 
2023-06-21 09:19:13: Starting workflow
Exception in thread WorkQueue-collector-thread:
Traceback (most recent call last):
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/parsl/process_loggers.py", line 27, in wrapped
    r = func(*args, **kwargs)
  File "/opt/lsst/software/stack/conda/miniconda3-py38_4.9.2/envs/lsst-scipipe-6.0.0/lib/python3.10/site-packages/parsl/executors/workqueue/executor.py", line 781, in _collect_work_queue_results
    raise ExecutorError(self, "Workqueue Submit Process is not alive")
parsl.executors.errors.ExecutorError: Executor work_queue failed due to: Workqueue Submit Process is not alive
2023-06-21 09:19:24: Workflow task count: 87
2023-06-21 09:19:24: Finished 87 of 87 tasks.
2023-06-21 09:19:34: Workflow complete: 87/87 tasks.
2023-06-21 09:19:34: All steps completed.
dladams commented 1 year ago

Here is the yaml config for the above job:

includeConfigs:
  - ${GEN3_WORKFLOW_DIR}/python/desc/gen3_workflow/etc/bps_drp_baseline.yaml
  - ${GEN3_WORKFLOW_DIR}/examples/bps_DC2-3828-y1_resources.yaml

pipelineYaml: "${DRP_PIPE_DIR}/pipelines/_ingredients/LSSTCam-imSim/DRP.yaml#isr"

payload:
  inCollection: LSSTCam-imSim/defaults
  payloadName: isr
  butlerConfig: /global/cfs/cdirs/lsst/production/gen3/DC2/Run2.2i/repo
  dataQuery: "visit>0 and skymap='DC2' and visit=277"

# Default task properties.
requestCpus: 1
requestMemory: 1000

monitoring: MonitoringHub(hub_address=address_by_hostname(),
                   hub_port=None,
                   monitoring_debug=False,
                   resource_monitoring_interval=60)

parsl_config:
  retries: 1
  executor: WorkQueue
  provider: Local
  nodes_per_block: 1
  worker_options: "--memory=20000"
dladams commented 1 year ago

I ran a status job for the above:

So the incorrect success reported by the original job is likely something for me to fix rather than a problem in gen3_workflow or parsl.

dladams commented 1 year ago

I verified that port 9000 was busy on the (perlmutter login) node where the above job was run:

 login13> checkport 9000 x
Port 9000 is in use: tcp LISTEN 0 100 127.0.0.1:9000 0.0.0.0:*

This is the code for the checkport command:

#!/bin/bash

PORT=${1:-9000}
VERBOSE=$2
LINE=$(ss -tualn | grep ":$PORT " 2>/dev/null)
STAT=$?
if test -n "$VERBOSE"; then
  if test -n "$LINE"; then
    echo Port $PORT is in use: $LINE
  else
    echo Port $PORT is not in use
  fi
fi
exit $STAT

I had to log in and out four times to find a machine where the port was not busy. Once I did , I resubmitted the above job. It ran a little slower spending about 2 min in the processing. But the result was the same: all tasks failed.

The console log now looks fine. The parsl log shows this for each task:

1687367401.943255 2023-06-21 10:10:01 MainProcess-1664149 WorkQueue-collector-thread-139936724911872 parsl.dataflow.dflow:351 handle_exec_update ERROR: Task 8 failed after 1 retry attempts. Last exception was: TypeError: Cannot subclass special typing classes

My run directory is /pscratch/sd/d/dladams/descprod-jobs/job000439 and is (now) world readable.

benclifford commented 1 year ago

For the type error, this is the relevant parsl PR https://github.com/Parsl/parsl/pull/2678

Basic fix to try: checking typing-extensions package mentioned at the end of that PR comment thread is at least 4.6.0

benclifford commented 1 year ago

To make parsl+wq choose a port automatically, specify port number 0 as the work queue port - this was introduced in PR #2602.

so for example this WorkQueueExecutor(label='work_queue', port=9000, ... should become WorkQueueExecutor(label='work_queue', port=0, ...

I think (but I'm not absolutely sure) that in gen3_workflow this happens in python/desc/gen3_workflow/config/parsl_configs.py in the workqueue_config function: https://github.com/LSSTDESC/gen3_workflow/blob/70e5b925f74f2192d7df9e8bddc7c070113fd81b/python/desc/gen3_workflow/config/parsl_configs.py#L142

which looks like it can't be explicitly configured - but could be modified in the gen3_workflow source.

dladams commented 1 year ago

@benclifford : I added a dependency on typing_extensions to my package and the typing errors have gone away but I get a crash when I check the futures:

2023-07-03 10:52:22: Starting workflow 
Traceback (most recent call last):
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/./local/desc-gen3-prod/bin/g3wfpipe-run.py", line 191, in 
    futures = [job.get_future() for job in pg.values() if not job.dependencies]
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/./local/desc-gen3-prod/bin/g3wfpipe-run.py", line 191, in 
    futures = [job.get_future() for job in pg.values() if not job.dependencies]
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 354, in get_future
    if self.done:
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 305, in done
    elif self.status == _SUCCEEDED:
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 331, in status
    if self.have_outputs():
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 378, in have_outputs
    for node in self.qgraph_nodes:
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 391, in qgraph_nodes
    qgraph = self.parent_graph.qgraph
  File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 517, in qgraph
    self._qgraph = QuantumGraph.loadUri(qgraph_file, DimensionUniverse())
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/graph.py", line 969, in loadUri
    qgraph = loader.load(universe, nodes, graphID)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/_loadHelpers.py", line 220, in load
    return self.deserializer.constructGraph(nodeSet, _readBytes, universe)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/_versionDeserializers.py", line 617, in constructGraph
    qnode = QuantumNode.from_simple(nodeDeserialized, loadedTaskDef, universe, recontitutedDimensions)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/quantumNode.py", line 139, in from_simple
    quantum=Quantum.from_simple(
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/quantum.py", line 449, in from_simple
    rebuiltDatasetRef = _reconstructDatasetRef(
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/quantum.py", line 64, in _reconstructDatasetRef
    reconstructedDim = DimensionRecord.from_simple(tmpSerialized, universe=universe)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/dimensions/_records.py", line 378, in from_simple
    record_model = record_model_cls(**simple.record)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for SpecificSerializedDimensionRecordInstrument
visit_system
  field required (type=value_error.missing)
Terminated
Handling SIGTERM (e.g. from timeout)
dladams commented 1 year ago

On closer inspection I see some log errors:

Installing collected packages: typing-extensions, desc-gen3-prod
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-paramiko, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-requests, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-six, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires typeguard<3,>=2.10, but you have typeguard 3.0.2 which is incompatible.
Successfully installed desc-gen3-prod-0.0.16.dev2 typing-extensions-4.7.1

Should I be installing those as well?

dladams commented 1 year ago

@benclifford I have not been able to get past the typing_extensions version problem. Are you able to run LSST jobs with the current parsl? The latest LSST weekly uses typing_extension 4.4.0.

Is there an older version of DESC parsl that I should use?

benclifford commented 1 year ago

I'm unclear if that pydantic stack trace is directly related to typing extensions: it seems to be reporting that a particular field (visit_system?) is missing constructing SpecificSerializedDimensionRecordInstrument

Thats seems far away from the parsl code - even though it's happening in the parsl part of gen3_workflow.

(eg this call in the middle of the stack trace that you pasted)

File "/pscratch/sd/d/dladams/descprod-jobs/job000447/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 517, in qgraph
    self._qgraph = QuantumGraph.loadUri(qgraph_file, DimensionUniverse())

I thought that the parsl code in gen3_workflow was meant to go away and be replaced by ctrl_bps_parsl? (maybe Jim, who I can't tag here, can clarify that?)

I'd rather not go debugging gen3_workflow if it really is unsupported now, but otherwise I guess we need to debug what's happening with the graph file being loaded here: self._qgraph = QuantumGraph.loadUri(qgraph_file, DimensionUniverse())

dladams commented 1 year ago

Jumping back to the port issue, Jim sent me this:

I was in the process of implementing and testing this, but then I saw that you can set the port yourself from the config like this:
parsl_config:
  retries: 1
  monitoring: true
  executor: WorkQueue
  provider: Local
  port: 0
This seems better than changing the default in code, especially since a different version of parsl is needed for it to work.

I make this mod in 0.0.16.dev8. Note this includes attempts to reolve the typing_extensions problem.

dladams commented 1 year ago

Jobs with the above mod seem to run fine on a machine where port 9000 was busy. An WorkQueue is now working though there are some build errors:

Successfully built desc-gen3-prod
Installing collected packages: typing-extensions, desc-gen3-prod
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-paramiko, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-six, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires typeguard<3,>=2.10, but you have typeguard 4.0.0 which is incompatible.
Successfully installed desc-gen3-prod-0.0.16.dev8 typing-extensions-4.6.0

Here I explicitly ask for typing_extension 4.6.0. I ran with apparent success using this config:

  "jobtype": "g3wfpipe",
  "config": "w2321-visit:277-pipe:isr-init-proc",
  "howfig": "shifter-wq:20-tmax:10m",

I set the version to 0.0.16.

benclifford commented 1 year ago

a quick review of those dependency errors:

These two are used if you want to do type checking as part of hacking on the parsl source code. You aren't doing that so it should not be a problem:

parsl 2023.5.29.dev0+desc.2023.6.02a requires types-paramiko, which is not installed.
parsl 2023.5.29.dev0+desc.2023.6.02a requires types-six, which is not installed

I'm fairly certain that parsl will be compatible with typeguard 4.0.0 - I'll try that out and change the requirements file if so:

parsl 2023.5.29.dev0+desc.2023.6.02a requires typeguard<3,>=2.10, but you have typeguard 4.0.0 which is incompatible.
dladams commented 1 year ago

@benclifford thanks for your comments here. Sorry I didn't see this before restarting the discussion on slack.

As reported there, I can now run successfully if I install parsl with dependencies (i.e. without the --no-deps flag) and don't install explicitly all the other packages. I will use this for my studies.

I did not succeed installing parsl without dependencies and explicitly installing the other package including typeguard<3 and typing_extensions==4.6.3. Instead I find these versions (using install from job 482):

 login24> g3wf-run-cvmfs w2321:tyx
g3wf-setup-cvmfs: Setting up LSST cvmfs distrib w_2023_21
g3wf-setup-cvmfs: Activating conda env /pscratch/sd/d/dladams/descprod-out/installs/g3wf-w2321-tyx
 cvmfs g3wf-w2321-tyx> pypkg_version typeguard
Package typeguard has version 4.0.0
 cvmfs g3wf-w2321-tyx> pypkg_version typing_extensions
Package typing_extensions has version 4.4.0

I will give up on this option for now and use parsl with dependencies.

dladams commented 1 year ago

The cvmfs setup/install script g3wf-setup-cvmfs is modified to install parsl with dependencies if no arguments are supplie with the release tag of if the argument "parsl" is used, e.g. w2321 or w2321:parsl. A test job with the former appeared to run fine. Changes are in 0.0.17.

dladams commented 1 year ago

The above success was with w2321 (LSST weekly w_2023_21). When I try with w2328, my job crashes because desc_wfmon is missing. I don't know why this wasn't an issue for the previous release.

I add install of desc_wfmon to local dir in runapp_wfpipe. Previously this was only done for shifter. With this change test job for w2328 runs ok.

dladams commented 1 year ago

The above change is commited in version 0.0.18.

dladams commented 1 year ago

For 0.0.19, add option butler that runs a simple test of the Butler to g3wfpipe.

dladams commented 1 year ago

I had a job fail with this error:

2023-08-01 12:16:34: Running workflow
2023-08-01 12:16:34: Total task count: 463
Traceback (most recent call last):
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/./local/desc-gen3-prod/bin/g3wfpipe-run.py", line 206, in 
    futures = [job.get_future() for job in pg.values() if not job.dependencies]
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/./local/desc-gen3-prod/bin/g3wfpipe-run.py", line 206, in 
    futures = [job.get_future() for job in pg.values() if not job.dependencies]
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 354, in get_future
    if self.done: 
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 305, in done
    elif self.status == _SUCCEEDED:
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 331, in status
    if self.have_outputs():
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 378, in have_outputs
    for node in self.qgraph_nodes:
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 391, in qgraph_nodes
    qgraph = self.parent_graph.qgraph
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000533/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 517, in qgraph
    self._qgraph = QuantumGraph.loadUri(qgraph_file, DimensionUniverse())
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/graph.py", line 969, in loadUri
    qgraph = loader.load(universe, nodes, graphID)
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/_loadHelpers.py", line 220, in load
    return self.deserializer.constructGraph(nodeSet, _readBytes, universe)
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/_versionDeserializers.py", line 617, in constructGraph
    qnode = QuantumNode.from_simple(nodeDeserialized, loadedTaskDef, universe, recontitutedDimensions)
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/pipe_base/g807f6e7cd4+93168c461e/python/lsst/pipe/base/graph/quantumNode.py", line 139, in from_simple
    quantum=Quantum.from_simple(
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/quantum.py", line 449, in from_simple
    rebuiltDatasetRef = _reconstructDatasetRef(
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/quantum.py", line 64, in _reconstructDatasetRef
    reconstructedDim = DimensionRecord.from_simple(tmpSerialized, universe=universe)
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2023_21/conda/envs/lsst-scipipe-6.0.0-exact-ext/share/eups/Linux64/daf_butler/g98bc2251b0+353af21826/python/lsst/daf/butler/core/dimensions/_records.py", line 378, in from_simple
    record_model = record_model_cls(**simple.record)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for SpecificSerializedDimensionRecordInstrument
visit_system 
  field required (type=value_error.missing)

In 0.0.20, I add a try block to catch this exception up to 10 times before giving up. When I resubmitted the job, the exception was not raised and the job finished without incident.

dladams commented 1 year ago

I notioced an exception was raised if user asked for status in g3wfpipe before the pipeline had started running. I fixed this to instead return a message "Workflow has not started.". Change is in 0.0.21.

dladams commented 1 year ago

For 0.0.22, add application g3wfenv that installs and checks, or removes g3wf environments. The check returns the parsl version.

dladams commented 1 year ago

For 0.0.23, add script that was inadvertently omitted from git in 0.0.22.

dladams commented 1 year ago

For 0.0.24, g3wfpipe now takes gen3_workflow from $HOME/descprod/repos/gen3_workflow if that location is readable.

dladams commented 1 year ago

For 0.0.25, update config templates so each job has a perfstat report in its error log.

dladams commented 1 year ago

For 0.0.26, add first pass at performance plots including monexp.ipynb copied from desc-wfmon. Navigate to the run directory in Jupyter and execute the notebook to generate performance plots.

dladams commented 1 year ago

For 0.0.27, add g3wfpipe howfig option bproc which splits the processing steps so that

  1. All steps before prod are run immediately
  2. Then a DESCprod is created and started to run proc in batch
  3. Then a DESCprod job is created to run the steps after proc (not yet implemented)

I also stopped putting the pickle file location in the progress log. It is a too-long status message.

dladams commented 1 year ago

For 0.0.28, remove the reinstallation of local packages in continuation jobs.

A bug in the PATH for the sysmon was introduced and then fixed in 0.0.29.

dladams commented 1 year ago

For 0.0.30:

dladams commented 1 year ago

For 0.0.31, add howfig option pmon:DT to set the parsl monitoring interval. This was added and may different values tested successfully.

dladams commented 1 year ago

I noticed I was getting slow parsl launch (2-3 sec/task) for a w2330 request. On investigation, I found the parsl install had failed and I was picking up parsl from the LSST release. For 0.0.32:

dladams commented 1 year ago

For 0.0.33:

dladams commented 1 year ago

For 0.0.34, I want to understand which files are accessed in pipeline tasks. I couldn't get iotrace to work as a command prepend and so went to strace with a file filter. Right now, all tasks write to strace.log in the run dir. The trace is only run for user dladams.

dladams commented 1 year ago

For 0.0.35, add howfig option strace to run strace with each task. I also added option stime to produces strace timing report.

dladams commented 1 year ago

For 0.0.36: First part a bproc job failed because sysmon.pid was not present. I modified g3wfproc-run to only kill monitor if that file is present.

dladams commented 1 year ago

For 0.0.37.

Versions 08 and 09 give this error for each task:

2023-09-13 17:06:43: Try 8 raised exception: 1 validation error for SpecificSerializedDimensionRecordInstrument
visit_system
  field required (type=value_error.missing)

Versions 10 and 11 give this error in a parsl import:

Traceback (most recent call last):
  File "./local/desc-gen3-prod/bin/g3wfpipe-run.py", line 178, in 
    from desc.gen3_workflow import start_pipeline
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000910/gen3_workflow/python/desc/gen3_workflow/__init__.py", line 1, in 
    from .parsl_service import *
  File "/pscratch/sd/d/dladams/descprod-out/jobs/job000910/gen3_workflow/python/desc/gen3_workflow/parsl_service.py", line 13, in 
    import lsst.utils
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/utils/ga34f01d9d8+f0a1800e64/python/lsst/utils/__init__.py", line 14, in 
    from .deprecated import *
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-6.0.0/Linux64/utils/ga34f01d9d8+f0a1800e64/python/lsst/utils/deprecated.py", line 20, in 
    import deprecated.sphinx
ModuleNotFoundError: No module named 'deprecated'

The docker files are at docker/gen3workflow

dladams commented 1 year ago

For 0.0.38 I am adding a prescription based on the dockerfile Jim Chiang uses to run with w_2023_32: https://lsstc.slack.com/archives/C038ZBE4QJJ/p1692824749070059

I drop the Rubin packages and make equivalent installs in g3wf-install-parsl-w2332. I also fix one "=" --> "==" and drop the python spec for ndctools (commented out).

I add dockerfile-12 which uses this script to build its env. But it does not work for w_2023_21.

dladams commented 1 year ago

For 0.0.39, start working with LSST w_2023_32:

dladams commented 11 months ago

Future development is described in #5.

dladams commented 11 months ago

For 0.0.40, added field nNNN to the strace howfig (e.g. strace:n10:file1) so there is a 1/NNN chance of running strace. I dropped time from the prepend. It was giving problems. I added the -f (follow children) flag in strace to see the fits files.