Add flag in agent to catch all exceptions

Rebecca-Qian commented 3 years ago

Type of Issue

Select the type of issue:

[x] Bug report (to report a bug)
[ ] Feature request (to request an additional feature)
[ ] Tracker (I am just using this as a tracker)
[ ] Refactor request
[ ] Documentation Ask

Description

Support a mode where the agent fails gracefully.

Current Behavior

Exceptions are always raised, which is bad UX for non internal users. This is especially problematic when we're running agent in remote containers, where we can't intervene to restart after crashes.

Expected Behavior

In production settings, let's catch all exceptions instead of raising.

Code pointers: https://github.com/facebookresearch/droidlet/blob/main/base_agent/core.py#L16 https://github.com/facebookresearch/droidlet/blob/77e9ad05dd68249a7c0fdba13c8b4f853ff5ec5f/base_agent/loco_mc_agent.py#L147 https://github.com/facebookresearch/droidlet/blob/main/base_agent/loco_mc_agent.py#L208

soumith commented 3 years ago

I really vote against this.

Exceptions shouldn't be raised in the first place. If we are expecting certain kinds of exceptions to be "non-robust", then they should be added to the whitelist with a very clear justification: https://github.com/facebookresearch/droidlet/blob/main/base_agent/loco_mc_agent.py#L195-L198

Otherwise you are just shipping broken software

aszlam commented 3 years ago

we need this for live interaction for annotation, demos, etc. There will always be unexpected errors- the agent need to be able to not crash when something stupid happens- in those situations, there should be a try/catch around the whole agent step

snyxan commented 3 years ago

I really vote against this.

Exceptions shouldn't be raised in the first place. If we are expecting certain kinds of exceptions to be "non-robust", then they should be added to the whitelist with a very clear justification: https://github.com/facebookresearch/droidlet/blob/main/base_agent/loco_mc_agent.py#L195-L198

Otherwise you are just shipping broken software

We certainly should deal with exception properly and not ignore them. What we want here is a mode where agent can keep running without being terminated. You can use this mode only when you need it (in our case, dashboard session in turk job).

soumith commented 3 years ago

i think i missed some kind of offline discussion, but honestly this isn't making much sense.

When exceptions are raised, you have no idea what state of brokenness the software is in. Even collecting data from turk (say for re-training) in some unknown broken state is completely insane.

If there will be a know set of unexpected errors, and we know that they will happen, the exceptions raised by those kinds of errors need to be caught by the exception allowlist that I linked at.

Catching any and all Python exceptions and pretending things are fine is not okay. How about Socket Exceptions, what about Disk read exceptions?

soumith commented 3 years ago

if you want to deal with this by just restarting the entire software, then feel free to add an outer loop to the turk deployment script that will explicitly "restart" the entire Python program from scratch if the process exited with a non-zero error code.

aszlam commented 3 years ago

longer term it might be viable to accurately snapshot agent state and restart; but there are problems with that short term. even then, for example, loading the models takes a decent amount of time. On the other hand, most of the errors we do see in interpreter (which is most of the errors annotators encounter) don't cause the state to be too weird, and it is really useful to log and keep going (rather than having to end that annotation job). w.r.t. " Even collecting data from turk (say for re-training) in some unknown broken state is completely insane." I disagree- we have been doing these jobs a while, and both the annotator reactions and logs have been useful. I do agree we can do a better job handling and properly catching errors throughout the agent; but I don't think we should stop annotation until that is improved.

tl/dr: it is broken, I agree we should fix it, but also, we need to iterate on annotating, and crashing out of annotation jobs is bad

aszlam commented 3 years ago

note that the proposal is to make "ignore exceptions" a flag, not the default behavior

Rebecca-Qian commented 3 years ago

i think i missed some kind of offline discussion, but honestly this isn't making much sense.

When exceptions are raised, you have no idea what state of brokenness the software is in. Even collecting data from turk (say for re-training) in some unknown broken state is completely insane.

If there will be a know set of unexpected errors, and we know that they will happen, the exceptions raised by those kinds of errors need to be caught by the exception allowlist that I linked at.

Catching any and all Python exceptions and pretending things are fine is not okay. How about Socket Exceptions, what about Disk read exceptions?

We're discussing this in the hack session, sorry I should've put more detail into this issue to not cause alarm 😛.

The idea is to handle all non fatal exceptions. I agree, some exceptions are non recoverable and should cause the agent to crash. In the FB app, the majority of traces come from non fatals, but some require a cold restart, like OOM errors.

The examples I'm thinking of are mostly commands that the interpreter fails to execute due to bugs in the agent code, not networking/IO failures. These currently cause the vast majority of crashes, and I really do believe the agent needs to be able to fail gracefully in production settings.

soumith commented 3 years ago

if you all want to make the interpreter eat all the exceptions that the interpreter generates, then do that, instead of invoking a global flag that affects all subsystems.

Don't screw over all the other subsystems as well in the process which need to die on FATAL.

The reason I cleaned up the exception handling last time is because they exceptions were eating up some multiprocess crashes in the perception module that were silently filling the memory with garbage. Stuff like that is really hard to debug, and does happen in real experiments (obviously).

Secondly,

The idea is to handle all non fatal exceptions. I agree, some exceptions are non recoverable and should cause the agent to crash. In the FB app, the majority of traces come from non fatals, but some require a cold restart, like OOM errors.

We are not arguing here about fatal vs non-fatal exceptions. We are arguing about eating all exceptions or emitting all exceptions. I am also talking about the paradigm that the FB app does, i.e. explicitly allowlist a category of exceptions that you know are safe (or explicitly deny a category of exceptions that need to be fatal).

aszlam commented 3 years ago

if you all want to make the interpreter eat all the exceptions that the interpreter generates, then do that, instead of invoking a global flag that affects all subsystems.

some form of this seems pretty reasonable. We would need to do it (separately) in interpreter, tasks, and memory, but I think thats it?

for the very short term (next few weeks) can we have the nasty flag on the mc side? we can commit to cleaning it up after neurips (if we don't do it in a hackathon before)

soumith commented 3 years ago

it's a patch that would take an hour at-most, but if you all would like to eat all exceptions for now, I won't stop you. please do fix it right as soon as you can though

facebookresearch / fairo