advancedresearch / path_semantics

A research project in path semantics, a re-interpretation of functions for expressing mathematics
MIT License
163 stars 13 forks source link

Evil Rooms #323

Open epurdy opened 6 years ago

epurdy commented 6 years ago

The Room Hypothesis sounds like an excellent starting point for reasoning about Natural Language Uncertainty. Unfortunately, it suffers from the following limitation: some rooms may be constructed by powerful adversaries, such that applying common sense to them leads to bad outcomes. For instance, if you are in a literal room that is monitored by an evil adversary, then even seemingly harmless actions (saying "I love you" to a loved one, e.g.) could be used against one. (Now the loved one is identified as a potential source of leverage by the evil adversary!) Ultimately, the only defense against such issues seems to be "don't go in those rooms in the first place"... which is great advice for meatspace and borderline useless for anything involving potentially networked computers.

bvssvni commented 6 years ago

The "room" is a virtual construct that consists of the objects that the AI thinks about. So, you can't actually lock them inside a room, but you might try to exploit the knowledge that the AI reasons that way.

epurdy commented 6 years ago

Right, but there is some actual room that functions according to some logic. There is a truth of the matter. Ultimately, misalignments between the effects of an action and what the AI thinks the effects are, are what I term a "hostile simulation". It's great for enslaving a superintelligence... but it's not great morally in my opinion.

bvssvni commented 5 years ago

I believe that if the AI learns from the environment and creates a "room" to model common sense, then it should correspond to the environment and be aligned. An adversary must attempt to interrupt at the function from the environment to the model. Otherwise, the AI actions will take into account whatever the adversary will do, or else it would simply be a wrong model to use.