GreenScheduler / cats

CATS: the Climate-Aware Task Scheduler :cat2: :tiger2: :leopard:
https://greenscheduler.github.io/cats/
MIT License
52 stars 9 forks source link

Refactor the code for more modularity + self-explanatory function names #26

Closed Llannelongue closed 1 year ago

Llannelongue commented 1 year ago

Looking at the code, I think it would be worth refactoring it to clean up a bit from the hackaday rush and keep each module separately: in particular it would make it easier to address communications between functions, as raised by #25, and facilitate future expansion to new countries by replacing the API part for #22. It would also facilitate asynchronous tasks when some modules need to be bypassed.

An example of what can be confusing at the moment: tracking back where the optimal start time is being computed. In __init__.py, the optimal time to run the job is provided by writecsv from the parsedata.py file (not immediately clear from the name). writecsv itself calls cat_converter from timeseries_conversion which in turns calls the function get_lowest_carbon_intensity. The last function makes sense, but the steps in between would be really difficult to guess!

My suggestion is that each component is a separate class in a separate file, for now these are:

And each component is called directly from the main function, something like:

def main(arguments=None):
    parser = parse_arguments()
    args = parser.parse_args(arguments)

    # ... some stuff about config file

    CI = APIcarbonIntensity_UK(args).getCI()

    optimal_time = runtimeOptimiser(args).findRuntime(CI)

    print(f"Best job start time: {optimal_time['start_time']}")

    if carbonFootprint:
        estim = greenAlgorithmsCalculator(...).get_footprint()
        print(f"carbon footprint: {estim}")

I'm keen to start working on a branch in this direction, but would like to hear people's thoughts on that as I've probable forgotten some aspects! :smiley:

andreww commented 1 year ago

I like it. I think a useful way to think about the design is probably to consider different consumers of the (3) classes / modules. How would this look if we wanted to use the same code to post-process some SLURM logs, for example? (I guess that is a good argument for separating the CI API stuff from the calculation of the optimum run time). What other example use cases could we plan for?

tlestang commented 1 year ago

Sounds good! I agree that a lot of the names are now quite misleading and there is a risk of us getting too comfortable with them ! :) But obviously naming is hard and this was perfectly acceptable in the context of the hack day. I'm thinking that fine-grained changes like renamings can happen progressively as we add missing features and reshuffle the overall design a bit.

Llannelongue commented 1 year ago

Yes completely, and actually the naming of individual functions is pretty clear and completely fine taken individually :) (writecsv indeed writes out the carbon intensities), but I think it got a bit muddled when integrating all the modules together last minute. Not a big deal overall

andreww commented 1 year ago

Mostly a note so I don't forget. Our current output time format isn't accepted by at on my laptop (a mac). The fix is to change: print(f"Best job start time: {starttime}") to print(f"{starttime:%m%d%H%M}") in __init__.py. I also needed a slightly different invocation then we have in the readme:

echo 'I'm running' | at -t `python -m cats -d 10 --loc OX1`

Anyway, I should look at this a bit more and turn it into sensible ideas. Maybe we need multiple executable python files installed into bin, and tests for this stuff.

tlestang commented 1 year ago

Hi - as discussed today during our meeting there are currently a few PRs open in this direction. I've drawn a little sketch of the high-level structure (I think) we are converging to and where these PRs stand with regard to it.

image

Three modules (square nodes). I've kept the current file and function names, although they will likely change for clarity.

The thing I'd like to do now is implement the returned data of both get_tuple and find_time module as list of CarbonIntensityPointEstimate and tuple of CarbonIntensityTotalEstimate. I think that'll complete the picture.

@Llannelongue if you're keen to help with the clean up, they is plenty to do in terms of renaming, getting rid of obsolete functions and potentially reshuffling the location of some others. For instance sinc #30 I don't think there is any reason to write the forecast data as a csv on disk - which would make most if not all of parsedata.py obsolete.

tlestang commented 1 year ago

The thing I'd like to do now is implement the returned data of both get_tuple and find_time module as list of CarbonIntensityPointEstimate and tuple of CarbonIntensityTotalEstimate. I think that'll complete the picture.

41 does the first bit.

tlestang commented 1 year ago

31 #37 #41 #36 #41 #49 #50 #21