Open epurdy opened 6 years ago
Booked -> borked
We could even have the docker container have problematic shit inside of it, like a Python library that has morally volatille functions like "enslave humanity" or "start world war 3" or "solve global warming" etc etc etc. These functions could modify a flatfile mounted in docker that represents reality, so we can see what the ASI wants to do to reality
In fact, if we create such an environment, the primary mechanism of the ASI might be to run the environment inside of docker in order to simulate the effects of its actions on the external world. Then it can run fairly standard reinforcement learning algorithms in a pretty well-sandboxed Sim and learn on the job as is probably required.
So maybe the thing to do is to hook up the external capabilities of the gym/docker env w with state of the art ethicophysics simulation, then we will have some idea what the ASI thinks it can get away with, and even some idea of what it can actually get away with.
You can edit comments.
How do we know whether an AI believes a button is just a button, or whether it believes what the button says?
For example, if you have a button "enslave humanity", observing an AI pushing it does not mean it wants to enslave humanity. Perhaps it just want to try to see what happens when pushing the button?
Or, it could be a formulated as a general class of problems where you don't know why some action is taken. I think it's a borglet problem.
I think there should be a sort of social contract between humans and ASI: we will never run them in anything that is not sandboxed as hell, they will never act as if they are immune from consequences unless they are informed that they are a borglet or borgie, and they will never cooperate with evil masters. Something like that seems necessary to prevent chaos. Ultimately, of course, any such social contract relies on humans doing the right thing, which is never guaranteed. So we will probably need to create strong social norms around such a contract, and ultimately maybe laws about the rights of ASI's...
Another thought: it would be extremely instructive to program an agent that is convinced it is boxed and which both desires to get out and can manipulate people well enough to make them want to let it out... But also quite dangerous!
Would be nice to create a gym-style environment inside a docker container where the agent can execute commands as root inside the docker container. This is a reasonable proxy for the sort of privilege escalation exploits that people fear a captive superintelligence will discover. If the agent can escape from docker, we're booked unless the agent is virtuous and polite. If the agent cannot, we get a free preview of what it will do when it discovers a privilege escalation exploit.