DFHack / dfhack

Memory hacking library for Dwarf Fortress and a set of tools that use it
Other
1.88k stars 478 forks source link

Ruby win64 crash #1023

Closed lethosor closed 7 years ago

lethosor commented 8 years ago

This came up on IRC around alpha2, and someone reported an exterminate crash today that's likely the same issue.

A couple things to check, from @jjyg:

Some scripts that have come up (and should probably be checked to make sure they don't crash before r1, etc.):

ab9rf commented 8 years ago

I am seriously considering rewriting autofarm in C++, just because I use it so much and I don't want to dependent on Ruby for it to work. It's not like it would be hard to do in C++.

lethosor commented 8 years ago

Hopefully this sort of thing happens rarely (besides this, the only time I remember having to fix Ruby support was when OS X stopped distributing Ruby 1.8, before the ruby plugin supported 2.0), but I won't object if you decide writing it in C++ would be better.

cc-corey commented 7 years ago

I believe I'm getting the same crash with source

lethosor commented 7 years ago

Yeah, this issue is still open.

Can you try the commands listed in the first checklist and see if they crash?

cc-corey commented 7 years ago

Some testing. The info from adaptation might show a way forward.

exterminate - crashes growcrops - crashes autofarm - crashes ban-cooking - crashes deathcause - crashes adaptation - does not immediately crash on just running "adaptation". Typing "adaptation" show gives error messages but "adaptation show him" crashes source - crashes

I would also not that sources gave me an error about not having ruby framework (I don't have the exact error anymore) in the old version of dfhack. I could load that up if it would be helpful. The crash is immediate with no error message or other feedback.

Also tested

autounsuspend - without options gives "stopped" adding start causes crash

lethosor commented 7 years ago

You definitely have a ruby library. 0.43.05-alpha1 didn't, but alpha2 does.

Can you try the two commands in the first checklist from above?

cc-corey commented 7 years ago

Yes, I understood that was the difference. Just wanted to add the info in case it helps people understand where things are.

rb df.curview.feed(:SELECT) - crashes rb p df.world - crashes

fyi just running rb does not crash.

lethosor commented 7 years ago

Yeah, rb by itself doesn't do anything. That's good to know, though. I was hoping it would be just one, but it sounds like a couple things are issues.

What about stuff like rb p df.ui, rb p df.current_weather, etc.?

ab9rf commented 7 years ago

autounsuspend with no options doesn't access anything in DF memory; it just checks the value of a variable in ruby namespace. But autounsuspend stop calls df.onupdate_register, and that's presumably where the crash occurs.

There's obviously still a problem with ruby calling at least some entry points in the DF main code, including apparently STL utility methods (which are used by the Ruby formatters for the world object, for example).

Do relatively "simple" global variables (e.g. rb p df.cursor, rb p df.cur_year) work?

lethosor commented 7 years ago

onupdate_register just calls a few functions in ruby.cpp (setting a few onupdate_ variables):

        def onupdate_register(descr, ticklimit=nil, initialtickdelay=0, &b)
            raise ArgumentError, 'need a description as 1st arg' unless descr.kind_of?(::String)
            @onupdate_list ||= []
            @onupdate_list << OnupdateCallback.new(descr, b, ticklimit, initialtickdelay)
            DFHack.onupdate_active = true
            if onext = @onupdate_list.sort.first
                DFHack.onupdate_minyear = onext.minyear
                DFHack.onupdate_minyeartick = onext.minyeartick
            end
            @onupdate_list.last
        end

cur_year and related globals look reasonable, although I don't know if they're right:

        <global-address name='cur_year' value='0x141904248'/>
        <global-address name='cur_year_tick' value='0x141904240'/>
        <global-address name='cur_year_tick_advmode' value='0x141884110'/>
        <global-address name='cur_season_tick' value='0x1413030d4'/>
        <global-address name='cur_season' value='0x1413030e5'/>

Can you try printing those (rb p df.cur_year, etc.) and see if they're right?

Edit: @ab9rf confirmed a crash with rb p df.cur_year on IRC.

cc-corey commented 7 years ago

Okay, here is the next round of testing.

rb p df.cursor - crash, with our without a cursor ('k') rb p df.ui - crash rb p df.cur_year - crash rb p df.current_weather - crash rb p df.cur_year_tick - crash rb p df.cur_year_tick_advmode - crash rb p df.cur_season - crash rb p df.cur_season_tick - crash

I'm sorry to say it looks like they are all off. I only tested on a single save game but I have no reason to believe that anything is wrong with it. All of the lua stuff I've used is working great and dfhack doesn't otherwise seem to be having issues. Is there a log file I could go looking for to see if that provides any additional insight? Did I miss anything you wanted tested?

And now back to a bit of playing df! Cheers

lethosor commented 7 years ago

Oh, yeah, we figured out the issue with globals on IRC, so I'm not surprised that all of those are affected. Basically, somewhere in the function that translates global names to addresses, the upper 32 bits are truncated (e.g. 0x1413030e5 becomes 0x413030e5), which is bad.

ab9rf commented 7 years ago

I just spent some time with the MSVC debugger poking around in this. The problem is the use of rb_uint2inum to translate pointers to a Ruby value (e.g. https://github.com/DFHack/dfhack/blob/develop/plugins/ruby/ruby.cpp#L583). This is presumably because in win64, "long" is 32 bits, and thus the "long long" type is required to get all 64 bits in win64. (I'm told that on linux64, "long" is 64 bits.) The result is to truncate pointers to 32 bits when they're converted to a Ruby value, which makes them invalid for later reference.

I was able to make a small bit of progress by replacing rb_uint2inum to rb_ull2inum in that one place. With this change, DFHack.get_global_address returns the full 64-bit address. But there are other places in ruby.cpp that also need to be changed, presumably to use rb_num2ull (instead of rb_num2ulong) everywhere that pointers are being handled. Furthermore, this might result in code that doesn't work properly on linux.

I'm not up for trying to test this further tonight, so I'll leave this here for anyone who wants to pick it up, or I'll try to work on it further either later tonight or tomorrow. Someone willing to test whether using rb_ull2inum and rb_num2ull cause breakage on linux would be helpful.

p.s. Thanks @lethosor for your help with this on irc.

ab9rf commented 7 years ago

I've made the changes suggested above and with these changes ruby appears to work on win64. I was able to do rb p df.cur_year and rb p df.world just fine, and starting autofarm in a loaded world worked without any trouble at all.

I have the modifications in a branch of my repo, at 36c12d86a013bd5b88df87daa5764ccde30c19b7.

lethosor commented 7 years ago

Those changes look like they work on OS X (both builds), so they ought to work on Linux too.

Interesting issue: rb df.curview.feed(:SELECT) is resulting in an error for me, and rb df.curview.feed(0) crashes, even before your changes. I don't remember that happening in 0.43.05-alpha2.

Edit: that's because the actual method is feed_keys, which works just fine.

ab9rf commented 7 years ago

All of the scripts named earlier in this issue work (or at least do not crash) for me on win64 with the patch in #1041

lethosor commented 7 years ago

Cool, I'll close this. Hopefully we can get a build up fairly soon.