LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.

BSD 3-Clause "New" or "Revised" License

7 stars 5 forks source link

Add ability to run a test on the same node as a previous test #103

Closed white238 closed 12 months ago

white238 commented 1 year ago

It would be very useful to add the ability to run a follow up command if the test passed that runs on the same node. For example, I have a test that outputs files that I want to check the answers in. Unfortunately, if I do the following:

foo_test = test(executable="foo") // outputs some file filled with answers
testif(foo_test, executable="foo_checker.py") // checks file output by foo for correctness

The foo_checker.py test runs on a completely different node from foo and the file is not there yet. So my team (and others I have spoken to) have to add sleep(60) to the start of our checker.

This could be avoided if they ran on the same node. Something like:

test(executable="foo", checker="foo_checker.py") // runs executable and then runs checker only if foo returns non-zero

or:

foo_test = test(executable="foo")
testif(foo_test, executable="foo_checker.py", same_node=True)

Edit another idea after talking to @dawson6:

// command_group maybe a better name than wrapper
my_wrapper = wrapper(executable=foo)
my_wrapper.add(executable=checker)
test(my_wrapper)

dawson6 commented 1 year ago

That is a good idea @white238 . We have a new person @MishaZakharchanka joining the ATS project, it may be something I can have him look at in a bit. These types of new features may be something we want to implement in 'flux' first (even if running flux under an salloc) so we don't have to figure it out for slurm and lsf and flux.

dawson6 commented 1 year ago

This is an outstanding issue. It may be time to revisit this again with the flux module to see if we can make progress here

MishaZakharchanka commented 1 year ago

@white238 We are finally revisiting this issue and we have a promising direction with a flux option. This will mean that running on the same node will be limited to flux and slurm (or flux under slurm) at least for now.

So far we have thought of 3 implementations of this feature but are open to discussing other ideas.

Option to run a test on node = X.
Option to run a test on the same node as test X.
Option to run an ATS group on the same node.

Let us know if you have a preference for any of these approaches or had a different idea for how this might work.

white238 commented 1 year ago

I don't personally care which specific nodes they run on... but the pattern is described above but basically I need my checker (which in this example is a testif to run on the same node as the previous test. Maybe a parameter called same_node_as_parent or something?

I don't know what an ATS group is.

jwhite242 commented 1 year ago

Think you can accomplish this without having to do convoluted things like running flux inside slurm/etc too -> see the --nodelist= option you can pass to srun.

Also believe our test suite uses the test groups to deal with this too, though i think we do still have a few cases of this being a problem (haven't been able to track down said rumor yet though).

dawson6 commented 1 year ago

   -w, --nodelist={<node_name_list>|<filename>}
          Request  a specific list of hosts.  The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements.
          The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or a filename.  The host list  will  be
          assumed  to  be a filename if it contains a "/" character.  If you specify a minimum node or processor count larger than can be satisfied by the sup-
          plied host list, additional resources will be allocated on other nodes as needed.  Rather than repeating a host name multiple times, an asterisk  and
          a repetition count may be appended to a host name. For example "host1,host1" and "host1*2" are equivalent. If the number of tasks is given and a list
          of requested nodes is also given, the number of nodes used from that list will be reduced to match that of the number of tasks if the number of nodes
          in the list is greater than the number of tasks. This option applies to job and step allocations.

dawson6 commented 1 year ago

So do you have an example Jeremy of what that looks like with slurm. Slurm has issues, so didn't want to spend a lot of time forcing slurm to do the right thing. But if this works we could look at it.

That is we don't want to use absolute hardward nodes for instance, too burdensome. But if we come up with a syntax for the interface, we can start with flux, then see if ATS could use slurm directly as well. But due to manpower issues, if we get flux working then just using 'flux under salloc' means less maintenance in the future.

dawson6 commented 1 year ago

Chris, if you can drop an example of what your ATS test lines look like for your 'parent' example, we can play with that. That is, we have examples like so:

ATS:t1=atswrapper(deck=SELF, level=10, np=10 )

ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, )

Where t11 depends on t1 finishing first, then runs a silodiff on output created from t1.. But not sure what your syntax is Chris.

For the above, we would have to demarcate both t1 and t11 somehow, as we need to put t1 on a specific node then put t11 on that same node, rather than query after the fact whic node t1 ran on. And we don't want to generally assign tests to nodes but rather allow flux or slurm to do the scheduling. But for this use case, we make an exception.

dawson6 commented 1 year ago

for the simplest solution, which maps to flux usage, we could do somthing like so, where we specify for both t1 and t11 to run on node 0 (relative to the allocation of nodes). But that does put the burden on the user to specify the node, but could likely be used by both flux and slurm (via nodelist) if that works with slurm, and ATS maps '0' to a specify hardware node in the allocation.

ATS:t1=atswrapper(deck=SELF, level=10, np=10, node_rank=0)

ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, node_rank=0)

So can we do something more abstract:

ATS:t1=atswrapper(deck=SELF, level=10, np=10, same_node=A)

ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, same_node=A)

Where A just is a string which is used to mean 'i don't care which node, but put all tests with the same_node=A on the same node. Then ATS assigns A to some node in the allocation as it deems fit.

Or we could say

ATS:t1=atswrapper(deck=SELF, level=10, np=10, parent_node)

ATS:t11=testif(t1, executable=run_silodiff, level=10, nn=1, np=1, checker_test=False, same_node_as_parent)

But that will be more complicated on install, as now ATS has to put in logic to determine who is the parent of a job and what node they were assigned. I mean it's doable, but more work and more room for bad logic.

Just some thoughts.