monkeypatching... - Githubissues

ltalirz commented 3 years ago

This PR illustrates how the mock-code executable can be monkey-patched into AiiDA's calculation submission (only the last commit)

All but one tests pass - I suspect this is due to the hash having changed.

Besides general code improvements, there are a couple of open issues:

[x] the part that copies the computed data back into the test data directory is not yet there (will likely need to monkey-patch a different function)
[ ] for some reason, raising an exception in the monkey-patched function results in the tests hanging. I first thought this was due to the exponential backoff mechanism, but this did not go away even when I turned it off on the AIiDA side (strange...). this will need to be solved since raising exceptions is exactly what we would like to do (e.g. to point out that there is no existing data in the test data directory)
[x] it would be nice not to have to copy the code of the replaced function. in principle I would just like to add a decorator to this function - I didn't find anything in mock to do this.
[x] currently, information is passed via the extras of the code instance, which is not ideal; in particular, I notice that there is no guaranteed way to get the code back from the calculation node (?). I'm relying on calcnode.inputs.code here, which is merely a convention, though

greschd commented 3 years ago

Thanks @ltalirz, I'll play around with this a bit next week.

greschd commented 3 years ago

@ltalirz I've thought a little bit about how / why the monkeypatching works. While I'm not 100% sure about it, I think it only works because whatever runs the calculation doesn't have a direct reference to the submit_calculation function before the monkeypatch is applied. That is, it either imports it after the monkeypatch is applied, or calls it as execmanager.submit_calculation.

A related an possibly more important point is that starting a separate Python process means the monkeypatch is gone. As such, we would have to hook into the daemon startup in a more intrusive way to get submit (instead of run) to work, I think.

ltalirz commented 3 years ago

A related an possibly more important point is that starting a separate Python process means the monkeypatch is gone. As such, we would have to hook into the daemon startup in a more intrusive way to get submit (instead of run) to work, I think.

Right, that would be a limitation.

I personally always use run in testing, though - not sure whether there are really use cases where you want to submit in a test? And I would stress that getting better errors for crashes due to missing executable, ... would be a big time saver for me in working with mock-code - currently, when something crashes, my workflow is to

import pdb; pbd.set_trace() in the test
run the test
in the debugger, load the calculation node
node.get_remote_workdir()
cd to the remote workdir and look at what is going on

I think in many cases, one could avoid this by raising an appropriate exception - either at the submit stage (if a necessary code is missing) or at the parsing stage (e.g. by patching in some additional checks)

greschd commented 3 years ago

not sure whether there are really use cases where you want to submit in a test

I've used submit in tests before, to find cases where an object in the ctx cannot be (de-)serialized. That applies only to workflows though. Of course one could use the mock_code also for the calculations in a workflow, but in principle export_cache is more suited for that. So if we can solve the other problems, maybe monkeypatching is still the better way to go.

greschd commented 3 years ago

All but one tests pass - I suspect this is due to the hash having changed.

I was actually a bit puzzled why the tests passed, because the hash produced by get_hash() in the monkeypatched version is wrong AFAICT - it uses the content of the CWD, which in that case is just wherever the tests are launched from. The reason this still works is, I think, because the mock-code CLI is still executed when the cache is not found. In the end, I think we want to get rid of that.

Maybe there's also a flaw in our testing hidden somewhere in here.

greschd commented 3 years ago

@ltalirz I think the next bigger thing to do here now is figuring out why the exceptions don't work as expected. I'm not totally convinced yet that it's not due to the exponential backoff mechanism. At the very least, when I simply try raising an error it does end up in the verdi process report multiple times. Maybe there's still some event loop related weirdness when the backoff mechanism is off, though. Could you share how exactly you disabled the exponential backoff in aiida-core, so that I can poke around a bit?

Side note: It would be useful if we implemented a --wait-before-wipe option in the aiida-core fixtures, that prompts before wiping the temporary directory.

ltalirz commented 3 years ago

Side note: It would be useful if we implemented a --wait-before-wipe option in the aiida-core fixtures, that prompts before wiping the temporary directory.

That is a great suggestion, and perhaps quite straightforward to implement (?) Happy to review a PR for it :-)

Could you share how exactly you disabled the exponential backoff in aiida-core, so that I can poke around a bit?

Nothing fancy, I just hardcoded max_attempts to 1 here: https://github.com/aiidateam/aiida-core/blob/68f840da709d796a81b8458171f34b2c13fbb784/aiida/engine/utils.py#L169

greschd commented 3 years ago

That is a great suggestion, and perhaps quite straightforward to implement (?) Happy to review a PR for it :-)

Yep, I already have it implemented (well, the reverse option) in aiida-pytest. Might need some polishing because that code is quite old by now. Suggestions for the exact naming of the flag are welcome.

Nothing fancy, I just hardcoded max_attempts to 1 here:

Ah, I think the problem might then be that the process goes into paused when the transport task fails repeatedly, instead of excepting? I think the workaround will be to raise an exception class that isn't caught.. if it exists. Maybe we need to monkeypatch AiiDA again for that otherwise.

greschd commented 3 years ago

By raising an exception derived directly from BaseException (against explicit advice from the Python docs) I was able to get around the protections here. The first exception raised this way shows up nicely in the pytest log.

However, then the subsequent test runs into RuntimeError: IOLoop is already running.

Maybe @sphuber knows how to cleanly put a calculation into excepted state from execmanager.submit_calculation and execmanager.retrieve_calculation? Monkeypatching some part of AiiDA or plumpy isn't off limits here.

EDIT: maybe class MockCodeException(plumpy.CancelledError, plumpy.Interruption) works..

sphuber commented 3 years ago

Nothing fancy, I just hardcoded max_attempts to 1 here:

Note that we now added a configuration option for this so you can do verdi config transport.task_maximum_attempts 1. It is in develop so not released yet

how to cleanly put a calculation into excepted state from execmanager.submit_calculation and execmanager.retrieve_calculation

If you are talking about the CalcJobNode, then all you have to do is simply call node.set_process_state(ProcessState.EXCEPTED). However, if you are relying on the corresponding Process to also behave accordingly, this will of course not work. But I am not sure what you are doing here and how much of the engine you are bypassing.

Edit:

I'm not totally convinced yet that it's not due to the exponential backoff mechanism.

The standard behavior of CalcJob processes is to pause after the EBM is triggered N times. So even if you set the number of iterations to 1, after that the process will still be paused and it will not actually except. So if you are expecting any exceptions in the transport tasks to actually except and terminate the process, then that will never happen.

If you want to raise an exception that is ignored by the EBM, you can derive it of plumpy.Interruption indeed. That is explicitly ignored but I am not a 100% sure if that will then cause the process to be excepted, I think so though

greschd commented 3 years ago

But I am not sure what you are doing here and how much of the engine you are bypassing.

We're not really bypassing the engine as such, instead we're monkeypatching the execmanager functions to add a caching layer. We can't use the usual caching because that circumvents what we're trying to test, namely writing of calculation input and parsing of outputs.

The purpose of raising an error though would be to give a clearer error message when this caching fails, and there isn't a fallback executable configured. So the ideal case would be that the exception bubbles up all the way to pytest. I'm not convinced that is possible though, since it might leave the engine in an inconsistent state (the IOLoop is already running error). If you know how to reset the engine to a clean state, that would also be helpful.

Raising plumpy.Interruption seems to put the calculation in FINISHED state with exit status 100 because the expected output files aren't there. That's already better than what we had before, but maybe we can improve on that by manipulating the calculation (process), monkeypatching run to detect this, or both.

greschd commented 3 years ago

Sorry for interleaving a separate discussion here; regarding

currently, information is passed via the extras of the code instance, which is not ideal; in particular, I notice that there is no guaranteed way to get the code back from the calculation node (?). I'm relying on calcnode.inputs.code here, which is merely a convention, though

Generally I think we want to keep the information associated to the Code node in some way, because that's what the fixture produces. For making the calculation -> code link more flexible, I see two options (both could be added later / when needed):

Adding a query-based lookup as a fallback, something like

qb = orm.QueryBuilder()
qb.append(orm.CalcJobNode, filters={'uuid': calculation.uuid}, tag='calc')
qb.append(orm.Code, with_outgoing='calc')
code = qb.one()[0]

Add a "registry" of lookup functions that map calculation classes to a function taking a calcjob node and returning the code node. Calculation developers could add their own lookup functions in there, for calculations that don't follow the convention.

Aside from the lookup of the code, are there other concerns to putting things in the extras?

Note that now I'm also storing stuff in the calculation extras, to communicate between the submit and retrieve steps.

sphuber commented 3 years ago

I'm relying on calcnode.inputs.code here, which is merely a convention, though

I guess this is technically true, since anyone can change the ProcessSpec of the CalcJob class, however, the base class CalcJob always have Code in the base namespace. Are you aware of any plugin that explicitly changes this and puts the Code node in a different namespace? Thinking about it, aiida-core may even assume that the code is going to be there in some places and may fail if one moves it.

greschd commented 3 years ago

Yeeah.. I feel we solve this particular problem when / if we get there.

sphuber commented 3 years ago

Finally coming back to the question of the exception handling. @ltalirz mentioned that he is sure that the observed undesired behavior of an exception being raised causing the tests to hang is not due to the EBM. However, @greschd seems to suggest that raising an exception that subclasses from plumpy.Interruption does work. This to me really suggests that the EBM is the culprit which makes perfect sense to me. Wouldn't it then just be the easiest to simply monkeypatch the aiida.engine.utils.exponential_backoff_retry method to simply call the function it is passed and omit all the EBM implementation? This should then cause any exception to simply be bubbled up

greschd commented 3 years ago

I'll give that a go, thanks @sphuber!

aiidateam / aiida-testing

monkeypatching... #49