Open ltalirz opened 3 years ago
Thanks @ltalirz, I'll play around with this a bit next week.
@ltalirz I've thought a little bit about how / why the monkeypatching works. While I'm not 100% sure about it, I think it only works because whatever runs the calculation doesn't have a direct reference to the submit_calculation
function before the monkeypatch is applied. That is, it either imports it after the monkeypatch is applied, or calls it as execmanager.submit_calculation
.
A related an possibly more important point is that starting a separate Python process means the monkeypatch is gone. As such, we would have to hook into the daemon startup in a more intrusive way to get submit
(instead of run
) to work, I think.
A related an possibly more important point is that starting a separate Python process means the monkeypatch is gone. As such, we would have to hook into the daemon startup in a more intrusive way to get
submit
(instead ofrun
) to work, I think.
Right, that would be a limitation.
I personally always use run
in testing, though - not sure whether there are really use cases where you want to submit
in a test?
And I would stress that getting better errors for crashes due to missing executable, ... would be a big time saver for me in working with mock-code - currently, when something crashes, my workflow is to
import pdb; pbd.set_trace()
in the testnode.get_remote_workdir()
I think in many cases, one could avoid this by raising an appropriate exception - either at the submit stage (if a necessary code is missing) or at the parsing stage (e.g. by patching in some additional checks)
not sure whether there are really use cases where you want to
submit
in a test
I've used submit
in tests before, to find cases where an object in the ctx
cannot be (de-)serialized. That applies only to workflows though. Of course one could use the mock_code
also for the calculations in a workflow, but in principle export_cache
is more suited for that.
So if we can solve the other problems, maybe monkeypatching is still the better way to go.
All but one tests pass - I suspect this is due to the hash having changed.
I was actually a bit puzzled why the tests passed, because the hash produced by get_hash()
in the monkeypatched version is wrong AFAICT - it uses the content of the CWD, which in that case is just wherever the tests are launched from.
The reason this still works is, I think, because the mock-code
CLI is still executed when the cache is not found. In the end, I think we want to get rid of that.
Maybe there's also a flaw in our testing hidden somewhere in here.
@ltalirz I think the next bigger thing to do here now is figuring out why the exceptions don't work as expected. I'm not totally convinced yet that it's not due to the exponential backoff mechanism. At the very least, when I simply try raising an error it does end up in the verdi process report
multiple times. Maybe there's still some event loop related weirdness when the backoff mechanism is off, though.
Could you share how exactly you disabled the exponential backoff in aiida-core
, so that I can poke around a bit?
Side note: It would be useful if we implemented a --wait-before-wipe
option in the aiida-core
fixtures, that prompts before wiping the temporary directory.
Side note: It would be useful if we implemented a
--wait-before-wipe
option in theaiida-core
fixtures, that prompts before wiping the temporary directory.
That is a great suggestion, and perhaps quite straightforward to implement (?) Happy to review a PR for it :-)
Could you share how exactly you disabled the exponential backoff in
aiida-core
, so that I can poke around a bit?
Nothing fancy, I just hardcoded max_attempts to 1 here: https://github.com/aiidateam/aiida-core/blob/68f840da709d796a81b8458171f34b2c13fbb784/aiida/engine/utils.py#L169
That is a great suggestion, and perhaps quite straightforward to implement (?) Happy to review a PR for it :-)
Yep, I already have it implemented (well, the reverse option) in aiida-pytest
. Might need some polishing because that code is quite old by now. Suggestions for the exact naming of the flag are welcome.
Nothing fancy, I just hardcoded max_attempts to 1 here:
Ah, I think the problem might then be that the process goes into paused
when the transport task fails repeatedly, instead of excepting? I think the workaround will be to raise an exception class that isn't caught.. if it exists. Maybe we need to monkeypatch AiiDA again for that otherwise.
By raising an exception derived directly from BaseException
(against explicit advice from the Python docs) I was able to get around the protections here. The first exception raised this way shows up nicely in the pytest log.
However, then the subsequent test runs into RuntimeError: IOLoop is already running
.
Maybe @sphuber knows how to cleanly put a calculation into excepted state from execmanager.submit_calculation
and execmanager.retrieve_calculation
? Monkeypatching some part of AiiDA or plumpy isn't off limits here.
EDIT: maybe class MockCodeException(plumpy.CancelledError, plumpy.Interruption)
works..
Nothing fancy, I just hardcoded max_attempts to 1 here:
Note that we now added a configuration option for this so you can do verdi config transport.task_maximum_attempts 1
. It is in develop
so not released yet
how to cleanly put a calculation into excepted state from
execmanager.submit_calculation
andexecmanager.retrieve_calculation
If you are talking about the CalcJobNode
, then all you have to do is simply call node.set_process_state(ProcessState.EXCEPTED)
. However, if you are relying on the corresponding Process
to also behave accordingly, this will of course not work. But I am not sure what you are doing here and how much of the engine you are bypassing.
Edit:
I'm not totally convinced yet that it's not due to the exponential backoff mechanism.
The standard behavior of CalcJob
processes is to pause after the EBM is triggered N times. So even if you set the number of iterations to 1, after that the process will still be paused and it will not actually except. So if you are expecting any exceptions in the transport tasks to actually except and terminate the process, then that will never happen.
If you want to raise an exception that is ignored by the EBM, you can derive it of plumpy.Interruption
indeed. That is explicitly ignored but I am not a 100% sure if that will then cause the process to be excepted, I think so though
But I am not sure what you are doing here and how much of the engine you are bypassing.
We're not really bypassing the engine as such, instead we're monkeypatching the execmanager
functions to add a caching layer. We can't use the usual caching because that circumvents what we're trying to test, namely writing of calculation input and parsing of outputs.
The purpose of raising an error though would be to give a clearer error message when this caching fails, and there isn't a fallback executable configured. So the ideal case would be that the exception bubbles up all the way to pytest
. I'm not convinced that is possible though, since it might leave the engine in an inconsistent state (the IOLoop is already running
error).
If you know how to reset the engine to a clean state, that would also be helpful.
Raising plumpy.Interruption
seems to put the calculation in FINISHED
state with exit status 100
because the expected output files aren't there.
That's already better than what we had before, but maybe we can improve on that by manipulating the calculation (process), monkeypatching run
to detect this, or both.
Sorry for interleaving a separate discussion here; regarding
currently, information is passed via the extras of the code instance, which is not ideal; in particular, I notice that there is no guaranteed way to get the code back from the calculation node (?). I'm relying on calcnode.inputs.code here, which is merely a convention, though
Generally I think we want to keep the information associated to the Code
node in some way, because that's what the fixture produces. For making the calculation -> code link more flexible, I see two options (both could be added later / when needed):
qb = orm.QueryBuilder()
qb.append(orm.CalcJobNode, filters={'uuid': calculation.uuid}, tag='calc')
qb.append(orm.Code, with_outgoing='calc')
code = qb.one()[0]
Aside from the lookup of the code, are there other concerns to putting things in the extras?
Note that now I'm also storing stuff in the calculation extras, to communicate between the submit
and retrieve
steps.
I'm relying on calcnode.inputs.code here, which is merely a convention, though
I guess this is technically true, since anyone can change the ProcessSpec
of the CalcJob
class, however, the base class CalcJob
always have Code
in the base namespace. Are you aware of any plugin that explicitly changes this and puts the Code
node in a different namespace? Thinking about it, aiida-core
may even assume that the code is going to be there in some places and may fail if one moves it.
Yeeah.. I feel we solve this particular problem when / if we get there.
Finally coming back to the question of the exception handling. @ltalirz mentioned that he is sure that the observed undesired behavior of an exception being raised causing the tests to hang is not due to the EBM. However, @greschd seems to suggest that raising an exception that subclasses from plumpy.Interruption
does work. This to me really suggests that the EBM is the culprit which makes perfect sense to me. Wouldn't it then just be the easiest to simply monkeypatch the aiida.engine.utils.exponential_backoff_retry
method to simply call the function it is passed and omit all the EBM implementation? This should then cause any exception to simply be bubbled up
I'll give that a go, thanks @sphuber!
This PR illustrates how the
mock-code
executable can be monkey-patched into AiiDA's calculation submission (only the last commit)All but one tests pass - I suspect this is due to the hash having changed.
Besides general code improvements, there are a couple of open issues:
mock
to do this.calcnode.inputs.code
here, which is merely a convention, though