Open epurdy opened 6 years ago
The "room" is a virtual construct that consists of the objects that the AI thinks about. So, you can't actually lock them inside a room, but you might try to exploit the knowledge that the AI reasons that way.
Right, but there is some actual room that functions according to some logic. There is a truth of the matter. Ultimately, misalignments between the effects of an action and what the AI thinks the effects are, are what I term a "hostile simulation". It's great for enslaving a superintelligence... but it's not great morally in my opinion.
I believe that if the AI learns from the environment and creates a "room" to model common sense, then it should correspond to the environment and be aligned. An adversary must attempt to interrupt at the function from the environment to the model. Otherwise, the AI actions will take into account whatever the adversary will do, or else it would simply be a wrong model to use.
The Room Hypothesis sounds like an excellent starting point for reasoning about Natural Language Uncertainty. Unfortunately, it suffers from the following limitation: some rooms may be constructed by powerful adversaries, such that applying common sense to them leads to bad outcomes. For instance, if you are in a literal room that is monitored by an evil adversary, then even seemingly harmless actions (saying "I love you" to a loved one, e.g.) could be used against one. (Now the loved one is identified as a potential source of leverage by the evil adversary!) Ultimately, the only defense against such issues seems to be "don't go in those rooms in the first place"... which is great advice for meatspace and borderline useless for anything involving potentially networked computers.