Great work, this is an long due effort in this field. Though it's a bit unexpected to see beaver-cost model performed poorly on safety-related dataset. Have you checked if you have got the signs worked out? Because in our setting negative reward means safer and should be chosen.
Quoting an author (I think):