flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

Flux Emulator Revival Questions #6466

Open washwor1 opened 1 day ago

washwor1 commented 1 day ago

Hello, I am currently working on getting the Flux emulator (originally simulator) from @SteVwonder (#2561) up and running with the latest version of Flux core. I got the code working with the old Flux core version and am now working on merging the code into the newest version of core. While I am doing that, I figured I would ask a few questions. I discussed these issues with @grondo and @wihobbs at the Flux coffee hour on Nov 22, and they suggested opening an issue so @garlick and others could weigh in. Here are the questions:

  1. Looking at the code from #2561 , are there any sections that look like they will need to be completely rewritten because of changes in Flux since original development? From what I can tell, it seems like most things are fairly decoupled and there should only be minor modifications. However, I am fairly new to Flux, so maybe I am missing something.

  2. The original code is missing the ability to handle jobs that are unsatisfiable (Line 259 in flux_simulator.py hangs). I was wondering what the recommended method/tool to implement this would be? From the coffee hour meeting, I was recommended to use either the jobtap plugin or wait on cancel exceptions through RPC (or a combination).

  3. Are there any other important features that should be added that would be useful to implement that would be useful for users of the finished emulator? I am currently wanting to expand on the post-sim analysis a bit and make sure that job timeouts work properly.

Thank you for the advice.

garlick commented 1 day ago

Great that you're trying to push this forward! The simulator has a lot of exciting possibilities for enabling scheduling research (the slurm simulator seems to make regular appearances) and helping us understand and improve flux's scheduler, write test cases, etc..

Have you gotten the old branch working and are you able to run simulations? Doing this and may be taking a stab at a draft description of how it works currently might be a helpful starting point for reviewing the approach in the context of today's Flux.

It will probably be a bit annoying to forward port after 5 years of flux development, but I'm not sure anything substantial has really changed in the interfaces between the job manager, the exec system, and the scheduler. There will be lots of little changes though.

trws commented 1 day ago

One thing I recall we discussed as this was getting started is this is after the work to port the simulator to the new exec system, so it may not be too bad. I imagine a lot of it is going to be things like handling unsatisfiable jobs or other states that didn't exist yet but need to be factored in.

As for 3, I'd say probably yes but don't worry about that yet. We'll have to see how the whole thing ties together to get an idea of what "just works" because of how it's implemented and what we'll want to be able to tweak.

trws commented 1 day ago

Looking it over, if the calls from job-manager can become a jobtap plugin (not sure but it looks like it at first glance), that would definitely help. The main thing that might need some thought is how to define "busy" and "quiescent" callbacks in fluxion after everything went asynchronous. It's not quite as easy as it used to be (we used to just process everything in-order, so when the callback ran, it was time) but now it will have to detect when the sched loop has no further work to do. That's actually possible, it puts the event loop to sleep in that case until something comes in, but it's a bit more work. There are also some strange states we didn't really have before, like there being jobs that are satisfiable but not reservable.