dladams commented 11 months ago

Continuing development. For earlier changes see #2.

dladams commented 11 months ago

For version 0.0.41 add option to rebuild an existing cvmfs SW cache.

Add option -r to g3wf-setup-cvmfs
Modify g3wf-run-cvmfs to pass this option along
Modify runapp-g3wfpipe to add howfig option rebuild to trigger rebuild of the SW cache

dladams commented 11 months ago

Version 0.0.41 also adds a config option to change the output directory so we can switch output file systems. Procedure is to create a directory or a link to a directory:

/global/cfs/cdirs/lsst/production/gen3/DC2/Run2.2i/repo/u/USER-XXX

where USER is the user name and then include field usuf:XXX in the config to direct data output there. Also, the field pmsYYY is equivalent to usuf:pmsYY and cfsYYY to usuf:cfsYYY. Note that XXX and YYY cannot include the config delimiters dash ('-') or colon (':'). YYY can be blank. If XXX is blank or all these fields are omitted, then the base directory is just USER (no dash).

dladams commented 11 months ago

For version 0.0.42:

add option repo:RRR to use butler dir $SCRATCH/repo-RRR.
Add python script g3wf-create-qg-repo.py that gets the list of input files from a QG (quantum graph) and provides the option to copy all but the raw files to another repository. Thus a QG generated with standard CFS-based DC2 repo can be used to populate a QG-specific scratch-based repository.
Add bash script g3wf-create-qg-repo which creates a QG-specific repository using the above script to get the list of raw files and and then ingest those. It can include calib files either using the above script or copying them from the DC2 CFS-based calib area on.
Modify g3wf-query-dc2 to add a command-line option to specify the butler repository.

dladams commented 11 months ago

For 0.0.42 continued:

I made updates to g3wf-create-qg-repo and used it to create /pscratch/sd/d/dladams/repo-1085 holding on scratch the raw and calib file for my 995 task test run. I had to add the default collection to build a QG from it. The code for this last step is in g3wf-create-default-chain.

dladams commented 10 months ago

For 0.0.43, I am adding script g3wf-run-task that provides a wrapper for running tasks, Here are current capabilities (dev1):

 login11>./g3wf-run-task --help
Usage: g3wf-run-task [OPT1 OPT2 ...] COM
  COM is the run command
  OPTi is any of
    --verbose: Enable verbose logging
      --norun: Do not run user or monitor commands
      -t TSAM: Set sampling period to TSAM seconds (0 to disable)
       --date: Monitor date and time.
         --ps: Monitor user process.
     --pstree: Monitor user process tree.
      --wtree: Monitor process tree for this script.
   --procstat: Monitor status for the run process (/proc/PID/stat).
     --cpunum: Monitor CPU number and affinity for the run process.
   --perfstat: Run task inside "perf stat -d".
All monitoring commands are run every TSAM seconds with that period
in seconds set to 10 or the preceding -t value.
All output lines for monitor flag --FFF begin with "FFF[ TTT]: "
where TTT is the time in seconds since the monitor was started.

Now using task prepend:

PREPEND="g3wf-run-task --perfstat"

to replicate old behavior.

With this, I see the messages from the new task wrapper go to the job log instea of the individual task logs. I modify the script ro send its progress and debig messages to stderr (dev2) and also its monloop messages to stderr (dev3), For the latter, I also add cpunum monitoring.

I leave the script in runapp with cpunum and proctxt monitored at 5 sec. But these do not work in shifter:15 because bc is missing. I fix that in shifter:16 and move to w_2023_44.

dladams commented 9 months ago

For 0.0.44, working toward running jobs with multiple single-frame task types. First some little fixes the setup.sh script produced by for runapp-g3wfpipe:

Make it clearer in log when user version of gen3_workflow is used.
Add warning when gen3_workflow is found to be installed.

It appears I can run successfully using the pipeline "ccd" that runapp-g3wfpipe maps to "isr,characterizeImage,calibrate" except if use WorkQueue with shifter. That problem is the subject of issue #7.

As part of trying to address that problem, I added something else?

dladams commented 9 months ago

For 0.0.45, I am trying to fix status reports for multi-type jobs that only report progress for the last type.

I made a change to fix this in g3wfpipe-run.py, removing the qualifier when selecting futures. I am not sure why it was done that way in the first place.

But then test job 1286 failed when failed install of desc-gen3-prod left job in the wrong directory. I changed runapp-g3wfpipe to use subprocess so that failure does not leave us in the wrong dir. Test job 1287 has the desired behavior.

dladams commented 9 months ago

For 0.0.46, investigate limiting the number of concurrent task sequences (e.g. isr-characterizeImage-calibrate) to force sequences to run to completion before starting new sequences.

First modify proc loop to test ParslJob instead of futures. (dev5, job 1295)
Next add option to process in groups and add the group ccd. (dev5, job 1296)

Note that process groups are described in Gower et al and documented along with the rest of BPS in the BPS quickstart

The job with grouping has the expected change in config.yaml:

 login26> diff job001295/config.yaml job001296
25a26,32
> 
> clusterAlgorithm: lsst.ctrl.bps.quantum_clustering_funcs.dimension_clustering
> cluster:
>   ccd:
>     pipetasks: isr, characterizeImage, calibrate
>     dimensions: detector, visit
>     equalDimensions: visit:exposure

I tried to run test job 1297/1298 w/o clustering but failed with exception:

2023-12-04 08:28:42: Finished 50 of 2985 tasks.
./local/desc-gen3-prod/bin/g3wfpipe-run: line 53: 1235164 Segmentation fault      $THISPTH.py $ARGS
g3wfpipe-run: Terminating sysmon.
g3wfpipe-run: Exiting with status 139.

Sysmon shows memory usage fairly flat and under 200 GB. I add stderr to the log in 0.0.46.dev7 to see if I get more info if this happens again.

dladams commented 9 months ago

I changed the processing in g3wfpipe-run.py to directly retrieve endpoint futures and monitor progress based on copies of ParslJob objects. But my busy jobs time out apparently because the ParslJob objects do not correctly reflect the sate of the corresponding task. This is discussed in LSSTDESC/gen3_workflow#59.

Following the discussion in that issue, I monitor progress using job names and check these using the dataframe in ParslGraph. When exec_done is reached, I then use ParslJob to check the log and see if the job succeeded or failed. This is triggered by a flag which is hardwired on for now. If the check fails then a bad log counter is incremented and that nonzero value is reported on the status line.

dladams commented 8 months ago

For 0.0.47, explore adding iotrace as an option. The iotrace I am using is github avner-levy/iotrace. It produces a summary file with one line for each file:

filename, bytes read, bytes written, number of opens, # reads, # writes

I also added a section to run perf stat with the process ID. It is still being run as the prefix in the run command. Here is a comparsion of the two for one task:

Here perf stat is called with the ID of the process that is forked
to execute the run command.

# started on Thu Dec 28 12:26:17 2023

 Performance counter stats for process id '151470':

         38,398.98 msec task-clock:u              #    0.834 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
         1,639,834      page-faults:u             #   42.705 K/sec                  
    91,768,680,967      cycles:u                  #    2.390 GHz                      (36.07%)
       185,319,256      stalled-cycles-frontend:u #    0.20% frontend cycles idle     (35.96%)
       684,704,315      stalled-cycles-backend:u  #    0.75% backend cycles idle      (36.09%)
   302,884,802,959      instructions:u            #    3.30  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (36.08%)
    63,498,038,601      branches:u                #    1.654 G/sec                    (35.71%)
       358,515,054      branch-misses:u           #    0.56% of all branches          (35.80%)

      46.050090565 seconds time elapsed

Here perf stat is the prefix in the run command.

         38,324.33 msec task-clock:u              #    0.853 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
         1,639,544      page-faults:u             #   42.781 K/sec
    92,152,328,018      cycles:u                  #    2.405 GHz                      (36.00%)
       193,446,804      stalled-cycles-frontend:u #    0.21% frontend cycles idle     (35.87%)
       673,915,003      stalled-cycles-backend:u  #    0.73% backend cycles idle      (35.79%)
   306,898,306,137      instructions:u            #    3.33  insn per cycle
                                                  #    0.00  stalled cycles per insn  (35.69%)
    64,264,430,874      branches:u                #    1.677 G/sec                    (35.43%)
       362,851,539      branch-misses:u           #    0.56% of all branches          (35.39%)
    73,691,713,364      L1-dcache-loads:u         #    1.923 G/sec                    (35.48%)
     1,810,377,586      L1-dcache-load-misses:u   #    2.46% of all L1-dcache accesses  (35.50%)
         LLC-loads:u
         LLC-load-misses:u

      44.942931902 seconds time elapsed

      29.389307000 seconds user
       9.352064000 seconds sys

dladams commented 8 months ago

0.0.47 continued.

Changes in g3wf-run-task:

I replace the loop waiting for the task to finish with a wait command.
I add flag perfstat1 to respectively run perf stat as a prepend with output to BASE.stderr (same as task). This is the old behavior for flag perfstat.
I add flag perfstat2 to run perfstat using the task process ID with output to BASE.perfstat.
Flag perfstat is now the same as perfstat2.
I send output from the task wrappers to BASE.tasklog except output from the task itself still goes to BASE.stderr.
Monitor outputs are sent to BASE.MNAME where MNAME is the name of monitor feature (cpunum, procctxt, ...)
Reinstitute the time prepend. Now in the wrapper.
Copy the strace prepend options in the runapp to a new script g3wf-strace-command and use that command to retrieve the strace command line to use with the process ID in the wrapper.
Move strace to the task wrapper with flags strace1, ... and modify runapp to use those flag names, strace1, ..., instead of the old strace:file1, etc.
I added an iotrace option to the task wrapper with output in BASE.iotrace with stdout/err to BASE.iotrace.stdout.
It is difficult to track files between tasks because there are file renames. I added strace4 to catch these but discovered we cannot run strace twice on the same process. I updated the wrapper not to do that and to issue a warning if the user requests it.
The iotrace option is added to runapp.
For now, the runapp is harwired to start with
```
g3wf-run-task --perfstat --iotrace -t 5 --cpunum --procctxt
```
I finally understand why the final progress message has been missing for two-stage jobs. Logging was to current dir and dir changes when 2nd stage is committed. I modified the runapp to use an absolute path for the log files.

dladams commented 8 months ago

0.0.47 continued:

I created gen3workflow dockerfile version 17 that starts from 16 and adds iotrace. i built an image for w_2023_44 and installed in on perlmutter. I also added a script build-iotrace that does as the name suggests and is used in the dockerfile.

I modified _g3wf_runshifter to improve the help message and change the default version to 17.

I noticed the perfstat info is very wrong (20k instead of 300G instructions, etc) for most of my tasks in batch, e.g. runs 1497 and 1500. This is not true before the perf stat changes for this development series. I modified the task wrapper to use perfstat1 (perf stat command) instead of perfstat2 (perf stat -p PID) and started another job (1502) to see if that fixes the problem. I also put in a NERSC ticket: INC0214404.

Running perf stat as a command wrapper (run 1502) did resolve the preceding problem.

I see most of the iotrace logs are empty except for header. Output has message:

strace: Process 2263952 attached
2263952 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2264454, si_uid=96594, si_status=0, si_utime=4641, si_stime=305} ---
new pid 2263952 process bash -c time /opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-7.0.1/Linux64/ctrl_mpexec/gef1ac55fff+9dabee4918/bin/pipetask --long-log --log-level=VERBOSE run-qbb /global/cfs/cdirs/lsst/production/gen3/DC2/Run2.2i/repo /pscratch/sd/d/dladams/descprod-out/jobs/job001503/submit/u/dladams/ccd/20240108T152445Z/u_dladams_ccd_20240108T152445Z.qgraph --qgraph-node-id fd9294f6-f9ba-4e14-a9d4-b119cca55103 Scanning process 2263952 table for files and sockets
Error: couldn't find file write details, file num  2
Error: couldn't find file write details, file num  2
Error: couldn't find file write details, file num  2
Error: couldn't find file write details, file num  2
2263952 +++ exited with 0 +++
Finished monitoring process

Maybe prepending time causes problems??

Modify runapp so no monitoring or tracing is run by default. Also options will set only one trace.

Add log message of task output space used, rate and free space. Also add monitor in main directory of task space and free space written to task-output-size.log. All for each iteration in the job loop, i.e every 10 sec.

dladams commented 8 months ago

By default the wrapper runs time only--no strace, iotrace, perfstat or any of the monitors. All but the last can be added as config options. Adding that is a todo.

Version is set to 0.0.47.

dladams commented 8 months ago

For 0.0.48, modify runapp to create plot-title.txt. Version 17.16 of monexp.py takes its plot titles from that file if present.

dladams commented 8 months ago

For 0.0.49, fix tracking of failed tasks.

Job 1596 (stage 2 of g3wfpipe | w2344-visit:250:260-pipe:ccd-init-proc | shifter:17-wq:270-bproc:90-tmax:10m:89m) had final status "Done. Finished 0 of 2985 tasks.". There are many (55k) lines in the job log like these:

2024-01-16 11:19:06: WARNING: Unexpected task status: failed
2024-01-16 11:19:06: WARNING: Unexpected task status: dep_fail

and near the end of the log, I see:

2024-01-16 11:26:54: WARNING: Unexpected task status: dep_fail
2024-01-16 11:26:54: Finished 0 of 2985 tasks.
2024-01-16 11:26:55: Task output size:      0.001 GiB, rate:   0.000 GiB/sec, available: 8292476 GiB on /pscratch'
2024-01-16 11:26:55: Aborting job because state is not changing and no tasks are active.
2024-01-16 11:26:55: All steps completed.

I added 'failed' and 'dep_fail' as known task states and increment nfail for those.

Here is the log for one of the isr jobs:

--> executable follows <--
g3wf-run-task ${CTRL_MPEXEC_DIR}/bin/pipetask --long-log --log-level=VERBOSE run-qbb /global/cfs/cdirs/lsst/production/gen3/DC2/Run2.2i/repo /pscratch/sd/d/dladams/descprod-out/jobs/job001595/submit/u/dladams/ccd/20240116T174856Z/u_dladams_ccd_20240116T174856Z.qgraph --qgraph-node-id 069a16e2-fb01-4dd3-9635-eb270e879a11 && >&2 echo success || (>&2 echo failure; false)
--> end executable <--
/bin/bash: g3wf-run-task: command not found
failure
--> executable follows <--
g3wf-run-task ${CTRL_MPEXEC_DIR}/bin/pipetask --long-log --log-level=VERBOSE run-qbb /global/cfs/cdirs/lsst/production/gen3/DC2/Run2.2i/repo /pscratch/sd/d/dladams/descprod-out/jobs/job001595/submit/u/dladams/ccd/20240116T174856Z/u_dladams_ccd_20240116T174856Z.qgraph --qgraph-node-id 069a16e2-fb01-4dd3-9635-eb270e879a11 && >&2 echo success || (>&2 echo failure; false)
--> end executable <--
/bin/bash: g3wf-run-task: command not found
failure

I suspect the "command not found" happens because the WorkQueue worker does not have this on its path:

g3wfpipe-run: Location for g3wf-run-task: /pscratch/sd/d/dladams/descprod-out/jobs/job001596/local/desc-gen3-prod/bin/g3wf-run-task

dladams commented 8 months ago

To address the last, I add a task setup script task-setup.sh with the intention of sourcing that as part of the task prepend. That didn't work and turned out to unnecessary as the problem was that the job setup script setup.sh was using relative paths for the products installed in ./local but the WorkQueue worker uses a dedicated directory for each task instead of running tasks from the job directory. I changed the setup to use full paths and can now run with WorkQueue.

The task wrapper will source task-setup.sh if it is found in the run directory but no such file is generated.

dladams commented 8 months ago

Version 0.0.50 fixes notebooks.

Last version broke the copying of notebooks to the job dir.

I also created a tag for this version.

dladams commented 7 months ago

Version 0.0.51 brings in some PRs from Ben:

9 updating ndcctools

10 adding the option to use WorkQueue in coprocess mode

11 add the option to disable parsl monitoring

A couple small test jobs ran successfully with these mods.

dladams commented 7 months ago

For version 0.0.52, add filesystem monitor.

Monitor is named g3wf-fsmonitor
Started automatically in g3wf-runapp if proc is selected: output in fsmon.txt

dladams commented 7 months ago

For 0.0.53

Make default shifter version 18
Add log of work_queue_worker version to g3wfpipe-run.py
Add search of user dladams when looking for shifter image

dladams commented 7 months ago

For 0.0.54, disable parsl resource monitoring by default, i.e. default is pmon:0.

dladams commented 7 months ago

For 0.0.55,

Add dev branch and put changes there so that other users pick up the last release (0.0.54) on the main branch.
Add options to limit the numbers of concurrently running starting tasks and task chains. See below.

The number of concurrent task chains is limited in g3wfpipe-run.py by calling get_future on a maximum of maxact final uncompleted tasks.

The number of concurrent running starting tasks is limited by adding a dependency on a parsl task which completes when a counter g3wfpipe-run.py exceeds a threshold.

dladams commented 6 months ago

Now at 0.0.55.dev93.

There is now a TP howfig tp:NTHR:MAXCST:MAXACT where

NTHR is the max number of threads provided to ThreadPool
MAXCST is the max number of concurrent starting tasks
MACACT is the maximum number of concurrent active chains controlled by limiting the number of chains activated by retrieving the future of the chain's last task.

Run 1903 shows this is not working as intended. With MAXACT=50, g3wfpipe reports

Finished 0 of 2985 tasks. 2785 pending. 117 launched. 83 running.

in its first few monitoring lines and the recently updated monexp notebook show 117 running isr tasks at the start of then job. The time stamps in the isr stderr logs agree with the latter value, i.e. 117 tasks start in the first few seconds.

So two problems:

We are not getting more than the requested number of running tasks: 117 instead of 50
The task state counts obtained from the ParslGraph table differ from those observed in the monexp notebook

I resolved the first problem by starting only the required number of chains and dropped the artificial prerequisite task which was more complicated and not working.

I suspect the second problem is stale data, perhaps the monitoring DB is not being flushed or ParslGraph the ParslGraph cache is not up to date. I will not chase this down now. I created issue #12 to address this.

dladams commented 6 months ago

As shown in section 10 of the performance summary, the change in 0.0.55 did not speed up the processing. For 0.0.56, we try adding a delay between chain starts to desynchronize the chain starts. For more discussion and reults see the performance summary.

LSSTDESC / desc-gen3-prod

Version 0.0.41 and beyond #5

9 updating ndcctools

10 adding the option to use WorkQueue in coprocess mode

11 add the option to disable parsl monitoring