christopherwharrop / rocoto

Rocoto Workflow Management System
Apache License 2.0
20 stars 16 forks source link

[bug] A serial metatask nested inside a parallel metatask is treated as parallel #109

Open WalterKolczynski-NOAA opened 1 month ago

WalterKolczynski-NOAA commented 1 month ago

What is wrong

When placing a serial metatask inside of a parallel metatask, all of the serial tasks will be queued at once when the dependency is met instead of running sequentially.

What should have happened

The serial tasks should wait for the task before them in sequence before being queued while sequences runs independently in parallel. Only the first task in each sequence should start when the explicit dependencies are satisfied.

Schedulers impacted

Seen on both slurm and pbspro

Steps to reproduce

  1. Create a workflow with a serial metatask inside a parallel one. Here's a minimal testcase that can be modified:
    
    <?xml version="1.0"?>
    <!DOCTYPE workflow
    [
    <!ENTITY ACCOUNT "fv3-cpu">
    <!ENTITY QUEUE "batch">
    <!ENTITY PARTITION "hercules">
    <!ENTITY OUTDIR "/work2/noaa/stmp/wkolczyn/test">
    ]>
&OUTDIR;/rocoto.log 202103211200 202103231200 06:00:00 echo "pre_task"; sleep 10 test_pre &ACCOUNT; &QUEUE; &PARTITION; 00:05:00 1:ppn=1:tpp=1 &OUTDIR;/test_pre.log 00 01 0 1 echo "member: #mem# segment: #seg#"; sleep 180 test_mem#mem#_seg#seg# &ACCOUNT; &QUEUE; &PARTITION; 00:05:00 1:ppn=1:tpp=1 &OUTDIR;/test_mem#mem#_seg#seg#.log

2. Run the workflow and observe all of the tasks inside the metatask be queued at once after the first job completes.

### Additional info

Encountered while trying to implement forecast segments for GEFS.

Switching the order of the metatasks (parallel inside of sequential) works correctly, but then none of the second serial tasks will run until all the first ones have completed.
christopherwharrop-noaa commented 1 month ago

Thank you for the report @WalterKolczynski-NOAA. And thank you especially for the reproducer. That makes it a lot easier for me to drill down on the problem. I will investigate and see if i can figure out what is going on. The reported behavior is definitely incorrect. I can test myself, but I'm assuming the behavior is the same when the outer metatask leaves the mode unspecified (the default is parallel).

WalterKolczynski-NOAA commented 1 month ago

Yes. I believe I tried both implicit and explicit parallel just in case.

WalterKolczynski-NOAA commented 1 month ago

Oops, I thought I had changed the queue back to batch before I submitted. If you use the debug queue, it will be harder to see the problem because there is a two-job at-a-time limit. Fixed now.