HerculesWS / Hercules

Hercules is a collaborative software development project revolving around the creation of a robust massively multiplayer online role playing game (MMORPG) server package. Written in C, the program is very versatile and provides NPCs, warps and modifications. The project is jointly managed by a group of volunteers located around the world as well as a tremendous community providing QA and support. Hercules is a continuation of the original Athena project.
http://herc.ws
GNU General Public License v3.0
891 stars 756 forks source link

Server crash pre-renewal, lastest revision #694

Open belfallen opened 9 years ago

belfallen commented 9 years ago

Process information: Command line: map-server.exe Platform: Windows Server 2012 Standard (build 9200) CPU: x86_64 CPU, Family 6, Model 26, Stepping 5 Application architecture: x86 Compiler: Microsoft Visual C++ 2012 (v1700) Git: f7a97b556e05373711998d4f819cf7d89e4b0ce8

Exception: 0xc0000005 EXCEPTION_ACCESS_VIOLATION at location 0x00814E02 reading from location 0x078B0F84

Registers: eax=078a12b0 ebx=7ffde000 ecx=0856d24c edx=078a12b0 esi=0018fbe4 edi=0018fce0 eip=00814e02 esp=0018fbe4 ebp=0018fce0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202

Stacktrace:

0 0x00814E02 in party_send_xy_timer+0x112 (tid=(int)18, tick=(long long)1645604315, id=(int), data=(int)) at c:\users\administrator\desktop\loki server\src\map\party.c:885

885 if( !sd || sd->bg_id ) continue; sd = (struct map_sessiondata)0x078A12B0 i = (int) iter = (struct DBIterator_)0x05C4F6BC bytes:69146100CC566100C7566100255861003F02610067026100C6346100 p = (struct party_data*)0x0856D24C bytes:D001000066656664660000000000000000000000000000000000000001000000000000002C851E00B24A02004C656F6E6172646F00000000000000000000000000000000A20F2D003F00CCCCCFCCCCCC00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000B0128A07AE20000099005600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

1 0x0065657D in do_timer+0x54D (tick=(long long)1645604330) at c:\users\administrator\desktop\loki server\src\common\timer.c:398

398 timer_data[tid].func(tid, timer_data[tid].tick, timer_data[tid].id, timer_data[tid].data); tid = (int)18 diff = (long long)-15

2 0x006292DC in main+0x24C (argc=(int)1, argv=(char**)0x0115FFA8) at c:\users\administrator\desktop\loki server\src\common\core.c:444

444 int next = timer->perform(timer->gettick_nocache()); next = (int)50 retval = (int)

3 0x009CDC8C in __tmainCRTStartup+0x10C () at f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:240

mainret = (int)<no data>
managedapp = (int)<no data>
initret = (int)<no data>

4 0x009CDE6D in mainCRTStartup+0xD () at f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:164

5 0x753C8543 in BaseThreadInitThunk+0xE ()

!!failed enumerating symbols!!#6 0x7776AC69 in RtlInitializeExceptionChain+0x85 () !!failed enumerating symbols!!#7 0x7776AC3C in RtlInitializeExceptionChain+0x58 () !!failed enumerating symbols!!

Loaded modules: 0x00400000 C:\Users\Administrator\Desktop\Loki Server\map-server.exe (0.0.0.0, 12587008 bytes) 0x77710000 C:\Windows\System32\ntdll.dll (6.2.9200.16384, 1404928 bytes) 0x753A0000 C:\Windows\System32\kernel32.dll (6.2.9200.16384, 1245184 bytes) 0x75C30000 C:\Windows\System32\KERNELBASE.dll (6.2.9200.16384, 679936 bytes) 0x75350000 C:\Windows\System32\ws2_32.dll (6.2.9200.16384, 327680 bytes) 0x10000000 C:\Users\Administrator\Desktop\Loki Server\libmysql.dll (0.0.0.0, 1462272 bytes) 0x00020000 C:\Users\Administrator\Desktop\Loki Server\zlib1.dll (1.2.7.0, 81920 bytes) 0x00240000 C:\Users\Administrator\Desktop\Loki Server\pcre3.dll (8.30.910.0, 159744 bytes) 0x760D0000 C:\Windows\System32\user32.dll (6.2.9200.16384, 1138688 bytes) 0x75DC0000 C:\Windows\System32\rpcrt4.dll (6.2.9200.16384, 704512 bytes) 0x75A90000 C:\Windows\System32\nsi.dll (6.2.9200.16384, 32768 bytes) 0x74100000 C:\Windows\System32\wsock32.dll (6.2.9200.16384, 32768 bytes) 0x756E0000 C:\Windows\System32\advapi32.dll (6.2.9200.16384, 712704 bytes) 0x75590000 C:\Windows\System32\msvcrt.dll (7.0.9200.16384, 724992 bytes) 0x75EB0000 C:\Windows\System32\gdi32.dll (6.2.9200.16384, 1036288 bytes) 0x74DB0000 C:\Windows\System32\sspicli.dll (6.2.9200.16384, 114688 bytes) 0x75E70000 C:\Windows\System32\sechost.dll (6.2.9200.16384, 212992 bytes) 0x74DA0000 C:\Windows\System32\CRYPTBASE.dll (6.2.9200.16384, 36864 bytes) 0x74D40000 C:\Windows\System32\bcryptPrimitives.dll (6.2.9200.16384, 331776 bytes) 0x73A90000 C:\Windows\System32\cryptsp.dll (6.2.9200.16384, 106496 bytes) 0x73A50000 C:\Windows\System32\rsaenh.dll (6.2.9200.16384, 253952 bytes) 0x75550000 C:\Windows\System32\imm32.dll (6.2.9200.16384, 131072 bytes) 0x75CE0000 C:\Windows\System32\msctf.dll (6.2.9200.16384, 901120 bytes) 0x749D0000 C:\Windows\System32\NapiNSP.dll (6.2.9200.16384, 65536 bytes) 0x749C0000 C:\Windows\System32\nlaapi.dll (6.2.9200.16384, 65536 bytes) 0x73B10000 C:\Windows\System32\mswsock.dll (6.2.9200.16384, 303104 bytes) 0x744A0000 C:\Windows\System32\dnsapi.dll (6.2.9200.16384, 479232 bytes) 0x74930000 C:\Windows\System32\winrnr.dll (6.2.9200.16384, 36864 bytes) 0x73CA0000 C:\Windows\System32\IPHLPAPI.DLL (6.2.9200.16384, 139264 bytes) 0x73C90000 C:\Windows\System32\winnsi.dll (6.2.9200.16384, 32768 bytes) 0x74770000 C:\Windows\System32\FWPUCLNT.DLL (6.2.9200.16384, 258048 bytes) 0x74760000 C:\Windows\System32\rasadhlp.dll (6.2.9200.16384, 28672 bytes) 0x74750000 C:\Users\Administrator\Desktop\Loki Server\plugins\dbghelpplug.dll (0.0.0.0, 57344 bytes) 0x72C80000 C:\Windows\System32\MSVCR110D.dll (11.0.50727.1, 1683456 bytes) 0x74350000 C:\Users\Administrator\Desktop\Loki Server\dbghelp.dll (6.12.2.633, 1314816 bytes) 0x74310000 C:\Windows\System32\powrprof.dll (6.2.9200.16384, 258048 bytes)

belfallen commented 9 years ago

@MishimaHaruna call for help:(

MishimaHaruna commented 8 years ago

Seems to be https://eathena.ws/forum/tracker/issue-5142-party-send-xy-timer-dangling-sd-pointer-crash/

MishimaHaruna commented 8 years ago

Since the eAthena forum seems to go down every now and then, I'll copy here the contents of the page I linked:


party_send_xy_timer dangling sd pointer crash

Posted by theultramage on 02 September 2015 - 03:36 AM

There is a very old unsolved crash bug in the mapserver that manifests itself as an occasional read from invalid memory in party_send_xy_timer() after dereferencing a dangling 'sd' pointer.

As a speed improvement, the party data structure holds, for each party member, a 'sd' pointer for fast access to the members' data. This avoids repeadted charid2sd lookups, especially when doing processing on all members. But, as with all caches, failure to keep perfect sync with the original data leads to memory access violations and/or memory corruption. Such as is the case here.

In r13169 (2008) [Hercules git ref: 2a95f21a791420a0bd34c84a24c54f5a524ecccf], trying to mitigate all the crashing that was going on, I patched party_send_xy_timer and guild_send_xy_timer_sub (which has the same problem) to ignore the cache and look up the pointer properly. This stopped the crashing, but the core issue was never fixed or even investigated.

This problem has propagated into all the forks. Both rAthena and Hercules revision logs show that in 2011, 'epoque' and 'shennetsind' undid both these patches to 'improve performance'. Clearly without bothering to look up why those patches are there, because I can find recent 'party_send_xy_timer' hits on google in both their support forums.

Posted by theultramage on 02 September 2015 - 03:59 AM

The session handling code is really convoluted, but some amount of code review shows the following code paths might be involved:

do_timer -> party_send_xy_timer(battle_config.party_update_interval=1000ms)
clif_parse -> clif_quitsave -> map_quit ->
-> chrif_save ->
-> -> chrif_auth_logout -> chrif_sd_to_auth -> "node->sd = sd;"
-> -> send(0x2b01) -> ... -> chrif_save_ack -> chrif_auth_delete -> "aFree(node->sd);"
-> unit_free_pc ->
-> -> unit_free -> party_send_logout -> intif_party_changemap -> ... -> intif_parse_PartyMove -> party_recv_movemap ->" p->data[i].sd = party_sd_check(party_id, account_id, char_id);"

Here '...' denotes an asynchronous wait for the charserver. Testing shows that 1) packets 0x2b01 (char save) and 0x3025 (party changemap) are sent in sequence, but are sent, received and processed independently. 2) the player 'sd' is indeed freed before the party cache is refreshed.

My hypothesis so far is that once in a while, something (buffers, TCP/IP) decides to not deliver the charserver's responses together and in full. This would mean that the mapserver frees the 'sd', and then gets a chance to do one core loop and process timers. The party xy update timer happens to be one of them, and everything explodes. Though, there is various special case handling going on on these code paths, so maybe the actual sequence of events is simpler than that.

A possible fix would be to invalidate the cache the moment the session gets marked as !active and handed off to the auth_db.


To be noted that, unlike described there, some research shows that p->data[i] seems to get invalidated in party_send_logout. Further research is needed.

4144 commented 8 years ago

This is probably incomplete reverse order call tree to chrif_auth_delete and my comments. If 0xNNNN present, this mean inter server message

chrif_auth_delete
    chrif_auth_finished
        pc_reg_received
            intif_parse_Registers 0x3804
                ... probably here safe?
    chrif_save_ack 0x2b21
        char_save_character_ack
            char_parse_frommap_save_character 0x2b01, flag != 0
                chrif_save
                    chrif_reconnect (ST_LOGOUT)
                        on map server start, safe
                    auth_db_cleanup_sub (node->state == ST_LOGOUT)
                        auth_db_cleanup
                            works by timer, here can be issue?
                    storage_storage_quit flag != 0
                        guild_broken (safe here)
                    pc_setpos m < 0 (safe here)
                    map_quit (safe here)
    chrif_changemapserverack 0x2b06
        char_change_map_server_ack
            char_parse_frommap_change_map_server 0x2b05
                chrif_changemapserver
                    chrif_reconnect
                        see previous
                    pc_setpos (safe here)
    chrif_reconnect
        chrif_on_ready (safe here)
    chrif_authok 0x2afd
        char_map_auth_ok
            char_parse_frommap_auth_request 0x2b26
                chrif_authreq
                    pc_autotrade_load
                        chrif_on_ready (safe here)
                    pc_autotrade_prepare
                        @autotrade
                    clif_parse_WantToConnection
                        called from client only if sd is NULL
    chrif_authfail 0x2b27
        char_map_auth_failed
            char_parse_frommap_auth_request (same with previous)
    auth_db_cleanup_sub
        same path with chrif_save_ack
    chrif_disconnectplayer 0x2b1f
        mapif_disconnectplayer
            char_set_char_online
                see previous
                char_parse_frommap_set_char_online 0x2b19
                    chrif_char_online
                        unused?
                char_parse_frommap_auth_request 0x2b26
                    see previous
                char_parse_char_select 0x66 (from client)
            char_db_kickoffline
                char_set_all_offline
                    char_parse_frommap_set_all_offline 0x2b18
                        chrif_char_reset_offline (on server exit, safe here)
            char_auth_ok
                char_parse_fromlogin_auth_state 0x2713
                    login_fromchar_auth_ack
                        login_fromchar_parse_auth 0x2712
                            loginif_auth
                                char_parse_char_connect 0x65 (from client)
                char_parse_char_connect 0x65 (from client)
            char_parse_fromlogin_kick 0x2734
                login_kick
                    login_auth_ok
                        login_parse_client_login 0x64, and other (from client)
            char_parse_frommap_set_users 0x2afe, 0x2aff
                send_usercount_tochar
                    by timer, can be issue here?
                send_users_tochar
                    chrif_on_ready
                        on server startup, safe here
    pc_autotrade_prepare
        safe here
    map_quit in some condition

Probably possible issue in chain functions: auth_db_cleanup clif_parse_WantToConnection char_parse_char_select char_parse_char_connect login_parse_client_login send_usercount_tochar and may be map_quit because some condition at first lines

From eathena bug, look like issue in function auth_db_cleanup

But need investigation is they can be really trigger this issue

MishimaHaruna commented 8 years ago

Some correlation between the two crashes (this and the similar one described in #861): In both cases, the struct party_data shown in the backtraces, had one party member (in both cases marked as online)

screen shot 2015-11-20 at 12 25 36

screen shot 2015-11-20 at 12 26 14

4144 commented 8 years ago

@MishimaHaruna you can reproduce crash? how?

MishimaHaruna commented 8 years ago

Unfortunately, I still can't :(

Those are just taken from the hex dumps the user provided in the ticket, I analyzed them to look for correlations...