Open belfallen opened 9 years ago
@MishimaHaruna call for help:(
Since the eAthena forum seems to go down every now and then, I'll copy here the contents of the page I linked:
Posted by theultramage on 02 September 2015 - 03:36 AM
There is a very old unsolved crash bug in the mapserver that manifests itself as an occasional read from invalid memory in party_send_xy_timer() after dereferencing a dangling 'sd' pointer.
As a speed improvement, the party data structure holds, for each party member, a 'sd' pointer for fast access to the members' data. This avoids repeadted charid2sd lookups, especially when doing processing on all members. But, as with all caches, failure to keep perfect sync with the original data leads to memory access violations and/or memory corruption. Such as is the case here.
In r13169 (2008) [Hercules git ref: 2a95f21a791420a0bd34c84a24c54f5a524ecccf], trying to mitigate all the crashing that was going on, I patched party_send_xy_timer and guild_send_xy_timer_sub (which has the same problem) to ignore the cache and look up the pointer properly. This stopped the crashing, but the core issue was never fixed or even investigated.
This problem has propagated into all the forks. Both rAthena and Hercules revision logs show that in 2011, 'epoque' and 'shennetsind' undid both these patches to 'improve performance'. Clearly without bothering to look up why those patches are there, because I can find recent 'party_send_xy_timer' hits on google in both their support forums.
Posted by theultramage on 02 September 2015 - 03:59 AM
The session handling code is really convoluted, but some amount of code review shows the following code paths might be involved:
do_timer -> party_send_xy_timer(battle_config.party_update_interval=1000ms)
clif_parse -> clif_quitsave -> map_quit -> -> chrif_save -> -> -> chrif_auth_logout -> chrif_sd_to_auth -> "node->sd = sd;" -> -> send(0x2b01) -> ... -> chrif_save_ack -> chrif_auth_delete -> "aFree(node->sd);" -> unit_free_pc -> -> -> unit_free -> party_send_logout -> intif_party_changemap -> ... -> intif_parse_PartyMove -> party_recv_movemap ->" p->data[i].sd = party_sd_check(party_id, account_id, char_id);"
Here '...' denotes an asynchronous wait for the charserver. Testing shows that 1) packets 0x2b01 (char save) and 0x3025 (party changemap) are sent in sequence, but are sent, received and processed independently. 2) the player 'sd' is indeed freed before the party cache is refreshed.
My hypothesis so far is that once in a while, something (buffers, TCP/IP) decides to not deliver the charserver's responses together and in full. This would mean that the mapserver frees the 'sd', and then gets a chance to do one core loop and process timers. The party xy update timer happens to be one of them, and everything explodes. Though, there is various special case handling going on on these code paths, so maybe the actual sequence of events is simpler than that.
A possible fix would be to invalidate the cache the moment the session gets marked as !active and handed off to the auth_db.
To be noted that, unlike described there, some research shows that p->data[i]
seems to get invalidated in party_send_logout
. Further research is needed.
This is probably incomplete reverse order call tree to chrif_auth_delete and my comments. If 0xNNNN present, this mean inter server message
chrif_auth_delete
chrif_auth_finished
pc_reg_received
intif_parse_Registers 0x3804
... probably here safe?
chrif_save_ack 0x2b21
char_save_character_ack
char_parse_frommap_save_character 0x2b01, flag != 0
chrif_save
chrif_reconnect (ST_LOGOUT)
on map server start, safe
auth_db_cleanup_sub (node->state == ST_LOGOUT)
auth_db_cleanup
works by timer, here can be issue?
storage_storage_quit flag != 0
guild_broken (safe here)
pc_setpos m < 0 (safe here)
map_quit (safe here)
chrif_changemapserverack 0x2b06
char_change_map_server_ack
char_parse_frommap_change_map_server 0x2b05
chrif_changemapserver
chrif_reconnect
see previous
pc_setpos (safe here)
chrif_reconnect
chrif_on_ready (safe here)
chrif_authok 0x2afd
char_map_auth_ok
char_parse_frommap_auth_request 0x2b26
chrif_authreq
pc_autotrade_load
chrif_on_ready (safe here)
pc_autotrade_prepare
@autotrade
clif_parse_WantToConnection
called from client only if sd is NULL
chrif_authfail 0x2b27
char_map_auth_failed
char_parse_frommap_auth_request (same with previous)
auth_db_cleanup_sub
same path with chrif_save_ack
chrif_disconnectplayer 0x2b1f
mapif_disconnectplayer
char_set_char_online
see previous
char_parse_frommap_set_char_online 0x2b19
chrif_char_online
unused?
char_parse_frommap_auth_request 0x2b26
see previous
char_parse_char_select 0x66 (from client)
char_db_kickoffline
char_set_all_offline
char_parse_frommap_set_all_offline 0x2b18
chrif_char_reset_offline (on server exit, safe here)
char_auth_ok
char_parse_fromlogin_auth_state 0x2713
login_fromchar_auth_ack
login_fromchar_parse_auth 0x2712
loginif_auth
char_parse_char_connect 0x65 (from client)
char_parse_char_connect 0x65 (from client)
char_parse_fromlogin_kick 0x2734
login_kick
login_auth_ok
login_parse_client_login 0x64, and other (from client)
char_parse_frommap_set_users 0x2afe, 0x2aff
send_usercount_tochar
by timer, can be issue here?
send_users_tochar
chrif_on_ready
on server startup, safe here
pc_autotrade_prepare
safe here
map_quit in some condition
Probably possible issue in chain functions: auth_db_cleanup clif_parse_WantToConnection char_parse_char_select char_parse_char_connect login_parse_client_login send_usercount_tochar and may be map_quit because some condition at first lines
From eathena bug, look like issue in function auth_db_cleanup
But need investigation is they can be really trigger this issue
Some correlation between the two crashes (this and the similar one described in #861): In both cases, the struct party_data shown in the backtraces, had one party member (in both cases marked as online)
@MishimaHaruna you can reproduce crash? how?
Unfortunately, I still can't :(
Those are just taken from the hex dumps the user provided in the ticket, I analyzed them to look for correlations...
Process information: Command line: map-server.exe Platform: Windows Server 2012 Standard (build 9200) CPU: x86_64 CPU, Family 6, Model 26, Stepping 5 Application architecture: x86 Compiler: Microsoft Visual C++ 2012 (v1700) Git: f7a97b556e05373711998d4f819cf7d89e4b0ce8
Exception: 0xc0000005 EXCEPTION_ACCESS_VIOLATION at location 0x00814E02 reading from location 0x078B0F84
Registers: eax=078a12b0 ebx=7ffde000 ecx=0856d24c edx=078a12b0 esi=0018fbe4 edi=0018fce0 eip=00814e02 esp=0018fbe4 ebp=0018fce0 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202
Stacktrace:
0 0x00814E02 in party_send_xy_timer+0x112 (tid=(int)18, tick=(long long)1645604315, id=(int), data=(int)) at c:\users\administrator\desktop\loki server\src\map\party.c:885
885 if( !sd || sd->bg_id ) continue; sd = (struct map_sessiondata)0x078A12B0
i = (int)
iter = (struct DBIterator_)0x05C4F6BC bytes:69146100CC566100C7566100255861003F02610067026100C6346100
p = (struct party_data*)0x0856D24C bytes:D001000066656664660000000000000000000000000000000000000001000000000000002C851E00B24A02004C656F6E6172646F00000000000000000000000000000000A20F2D003F00CCCCCFCCCCCC00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000B0128A07AE20000099005600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1 0x0065657D in do_timer+0x54D (tick=(long long)1645604330) at c:\users\administrator\desktop\loki server\src\common\timer.c:398
398 timer_data[tid].func(tid, timer_data[tid].tick, timer_data[tid].id, timer_data[tid].data); tid = (int)18 diff = (long long)-15
2 0x006292DC in main+0x24C (argc=(int)1, argv=(char**)0x0115FFA8) at c:\users\administrator\desktop\loki server\src\common\core.c:444
444 int next = timer->perform(timer->gettick_nocache()); next = (int)50 retval = (int)
3 0x009CDC8C in __tmainCRTStartup+0x10C () at f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:240
4 0x009CDE6D in mainCRTStartup+0xD () at f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c:164
5 0x753C8543 in BaseThreadInitThunk+0xE ()
!!failed enumerating symbols!!#6 0x7776AC69 in RtlInitializeExceptionChain+0x85 ()
!!failed enumerating symbols!!#7 0x7776AC3C in RtlInitializeExceptionChain+0x58 ()
!!failed enumerating symbols!!
Loaded modules: 0x00400000 C:\Users\Administrator\Desktop\Loki Server\map-server.exe (0.0.0.0, 12587008 bytes) 0x77710000 C:\Windows\System32\ntdll.dll (6.2.9200.16384, 1404928 bytes) 0x753A0000 C:\Windows\System32\kernel32.dll (6.2.9200.16384, 1245184 bytes) 0x75C30000 C:\Windows\System32\KERNELBASE.dll (6.2.9200.16384, 679936 bytes) 0x75350000 C:\Windows\System32\ws2_32.dll (6.2.9200.16384, 327680 bytes) 0x10000000 C:\Users\Administrator\Desktop\Loki Server\libmysql.dll (0.0.0.0, 1462272 bytes) 0x00020000 C:\Users\Administrator\Desktop\Loki Server\zlib1.dll (1.2.7.0, 81920 bytes) 0x00240000 C:\Users\Administrator\Desktop\Loki Server\pcre3.dll (8.30.910.0, 159744 bytes) 0x760D0000 C:\Windows\System32\user32.dll (6.2.9200.16384, 1138688 bytes) 0x75DC0000 C:\Windows\System32\rpcrt4.dll (6.2.9200.16384, 704512 bytes) 0x75A90000 C:\Windows\System32\nsi.dll (6.2.9200.16384, 32768 bytes) 0x74100000 C:\Windows\System32\wsock32.dll (6.2.9200.16384, 32768 bytes) 0x756E0000 C:\Windows\System32\advapi32.dll (6.2.9200.16384, 712704 bytes) 0x75590000 C:\Windows\System32\msvcrt.dll (7.0.9200.16384, 724992 bytes) 0x75EB0000 C:\Windows\System32\gdi32.dll (6.2.9200.16384, 1036288 bytes) 0x74DB0000 C:\Windows\System32\sspicli.dll (6.2.9200.16384, 114688 bytes) 0x75E70000 C:\Windows\System32\sechost.dll (6.2.9200.16384, 212992 bytes) 0x74DA0000 C:\Windows\System32\CRYPTBASE.dll (6.2.9200.16384, 36864 bytes) 0x74D40000 C:\Windows\System32\bcryptPrimitives.dll (6.2.9200.16384, 331776 bytes) 0x73A90000 C:\Windows\System32\cryptsp.dll (6.2.9200.16384, 106496 bytes) 0x73A50000 C:\Windows\System32\rsaenh.dll (6.2.9200.16384, 253952 bytes) 0x75550000 C:\Windows\System32\imm32.dll (6.2.9200.16384, 131072 bytes) 0x75CE0000 C:\Windows\System32\msctf.dll (6.2.9200.16384, 901120 bytes) 0x749D0000 C:\Windows\System32\NapiNSP.dll (6.2.9200.16384, 65536 bytes) 0x749C0000 C:\Windows\System32\nlaapi.dll (6.2.9200.16384, 65536 bytes) 0x73B10000 C:\Windows\System32\mswsock.dll (6.2.9200.16384, 303104 bytes) 0x744A0000 C:\Windows\System32\dnsapi.dll (6.2.9200.16384, 479232 bytes) 0x74930000 C:\Windows\System32\winrnr.dll (6.2.9200.16384, 36864 bytes) 0x73CA0000 C:\Windows\System32\IPHLPAPI.DLL (6.2.9200.16384, 139264 bytes) 0x73C90000 C:\Windows\System32\winnsi.dll (6.2.9200.16384, 32768 bytes) 0x74770000 C:\Windows\System32\FWPUCLNT.DLL (6.2.9200.16384, 258048 bytes) 0x74760000 C:\Windows\System32\rasadhlp.dll (6.2.9200.16384, 28672 bytes) 0x74750000 C:\Users\Administrator\Desktop\Loki Server\plugins\dbghelpplug.dll (0.0.0.0, 57344 bytes) 0x72C80000 C:\Windows\System32\MSVCR110D.dll (11.0.50727.1, 1683456 bytes) 0x74350000 C:\Users\Administrator\Desktop\Loki Server\dbghelp.dll (6.12.2.633, 1314816 bytes) 0x74310000 C:\Windows\System32\powrprof.dll (6.2.9200.16384, 258048 bytes)