GreenScheduler / cats

CATS: the Climate-Aware Task Scheduler :cat2: :tiger2: :leopard:
https://greenscheduler.github.io/cats/
MIT License
50 stars 8 forks source link

Refactor code (and some other changes) #43

Closed Llannelongue closed 1 year ago

Llannelongue commented 1 year ago

This is following discussions on #26 and during the meeting that now that we are going toward a stable version, it would be worth refactoring the code to make it a bit more future-proof.

This is most likely not final at all, and it'd be great to hear everyone's thoughts to refine it. And also to check that it still works as expected (in particular if I have inadvertently deleted something important!). And we can open dedicated issues for anything arising for discussions here.

Here is a schema of the draft new structure:

cats_refactored drawio

The guiding principles in refactoring were to try to:

As a reminder, the main steps of CATS are:

  1. Read data/information from the command line and the config file
  2. Pull forecast data from a carbon intensity (CI) API
  3. Find the window of time (start time + end time) that would minimise average carbon intensity within the known forecast
  4. (optional) Estimate the total energy used and carbon footprint by starting or not at the best time

An overview of some changes beyond structure

I aimed at keeping the functionalities as they were as much as possible, just trying to enhance when I could. Here are some changed:

Carbon intensity forecast loaded for between 29/05/2023-20:30 UTC and 31/05/2023-20:30 UTC.

Best start time: 31/05/2023-12:00 UTC (in 39.1 hours) Expected end time: 31/05/2023-17:50 UTC Expected average carbon intensity: 70.03 gCO2e/kWh

Estimated carbon footprint of running job at best time: 10.68 gCO2e vs running it now: 13.26 gCO2e (2.58 gCO2e saved) vs running at worst time (31/05/2023-01:00 UTC): 17.54 gCO2e (6.87 gCO2e saved)

How this release addresses existing open issues

Open questions:

  1. Why some things that are not errors are being printed on stderr and others on stdout? As a temporary measure, all updates/messages are sent to stdout and only errors/warnings to stderr, but would be great to hear some thoughts on that.
  2. I suspect I may have broken the pipeline with the at linux scheduler with the different output, any thoughts on the best way to pass information on start time?

What is still needed

Not necessarily before merging, but here is a list non-exhaustive of what still needs to be done (the idea of this PR is to discuss this new structure while polishing the details).

Llannelongue commented 1 year ago

Thanks for these @tlestang

I don't see the benefit of introducing a new plain class for the API interface, compared to the currently implemented APIInterface namedtuple. It provides the same access to the request url and response parsing functions through dot notation and a lot less code.

More on this below after the second comment, but regarding class vs namedtuple, I don't mind as long as it's easy to switch between APIs (so we would need a master function to pick the right namedtuple in case it's not provided directly by the user). I can draft a new commit reversing to namedtuple.

More problematic to me is that the API-specific logic of forming the query URL and parsing the response is coupled back to the API interface code. Somebody wishing to interface cats with a new web API would have to modify the implementation of CI_API_interface itself, adding a new statememt if self.choice_CI_API == 'myapi.org' and hardcoding API-specific code in there. I feel like this is a step back from the current implementation (introduced in #37) in which interfacing against a different API is only a matter of providing two functions (possibly grouping them into a new APIInterface object) without having to touch the code of cats itself at any point.

This is an interesting point, and probably needs more thinking. However not sure about not having to touch the code, I see two different use cases here:

  1. We or other contributors will want to add other CI APIs (e.g. for other countries), and we ideally want to make them part of CATS so that these new APIs are available to the whole community. In this case, it would be good to have all the URL/parsing codes in the same place and api_inferface is a good place for that (it also makes it easier to add things by copy-pasting). And in terms of how much hassle it is to add it, it's equivalent now and with the new code (api_interface needs to be modified either way, and current code requires messing with __init__.py as well), but the existing code doesn't allow user to easily pick an API, this is what the new argument --api-carbonintensity introduces.

  2. Second use case is if users want to pass their own API wrapper directly to CATS without having to modify the code. And in this case I agree, it would be good to make it possible in an easier way. But how would that work in practice? It would be good to have an idea of how the user would do it if we want to implement it.

I opened issue #44 to have a space to discuss (2) further.

Lastly, do we really want to rename api_query.py and api_interface.py ?

The motivation behind that is to clarify that the code in there is related to pulling carbon intensities, in case more APIs for other parts of the code are included later. But at the moment, this is only future-proofing so I don't mind removing the CI_ in the names if it's an issue for now.

Llannelongue commented 1 year ago

Following up on that, commit 8c927c7 reintroduces namedtuple for the api interfaces.

tlestang commented 1 year ago

We or other contributors will want to add other CI APIs (e.g. for other countries), and we ideally want to make them part of CATS so that these new APIs are available to the whole community. In this case, it would be good to have all the URL/parsing codes in the same place and api_inferface is a good place for that (it also makes it easier to add things by copy-pasting).

Agreed 100%, and that's the motivation behind #37. But the separation between the API query code and the API specific bits wasn't done with a particular usage of cats in mind - its generally useful for developers adding new APIs because they don't have to touch the api query module. Turns out its also useful for users playing with cats as a library as they can actually bring they own functions. Sure, the right default API interface must be selected in the init script for now but that's about it. Could as well be specified at the command line.

andreww commented 1 year ago

I'm not sure exactly where we are with this one - does it make sense to discuss at the meeting on Wednesday. One possibility would be to try and chop the changes into smaller sets but I think some wider discussion around the architecture could be useful.

My two cents is that a useful approach to thinking about this is to consider other future uses of (maybe bits of) the code base and think about how they would work. So, for cats, I imagine a user who has some big python package providing a service that runs some regular processing, and it's the processing that they want to schedule (no need to integrate with a scheduler). Or, somebody wants to post-process job scheduler logs.

Some (partial) answers to the questions above:

Why some things that are not errors are being printed on stderr and others on stdout? As a temporary measure, all updates/messages are sent to stdout and only errors/warnings to stderr, but would be great to hear some thoughts on that.

The way the python currently interfaces with at is via a shell backtick expansion - stuff that comes out of stdout from cats goes into an argument to at and (from the point of view of at) everything else is ignored and goes to the terminal. We could sensibly use python's subprocess module to do this instead. This would give us more control but could limit what the user could do.

tlestang commented 1 year ago

I'm closing this for now as we discussed it won't be merged as-in. But let's keep the branch around for reference.