DarkstarProject / darkstar

DEPRECATED - FFXI Server Emulator - See Project Topaz
https://github.com/project-topaz/topaz
GNU General Public License v3.0
455 stars 549 forks source link

Recast/zoneentities/zoneserver #668

Closed Warp-ass closed 9 years ago

Warp-ass commented 9 years ago

crashing extremely often on the current version, while on debug VS first points to a vector, and then points to recast container as the next step after it returns from the current function. my guess is that it is crashing on the GetRecastList when checking

RecastList_t* PRecastList = GetRecastList((RECASTTYPE)type);

because it says the next step will be the next line:

for (uint16 i = 0; i < PRecastList->size(); ++i)

When I follow it back in the callstack it points to zone_entities and zoneserver. I mention this because prior to getting the current version running, we were having a number of issues with zone_settings.sql . I am still trying to debug these crashes and create dumps on them, but many of the crashes break on different places, but the dumps all show that the call stack includes zoneserver. I'm still working hard to pinpoint this but if anyone has any input on this, it would be greatly appreciated.

Warp-ass commented 9 years ago

update: while almost every crash was originally on the recast, now it seems to be coming very often from places that cannot be seen directly in the call stack. looking at the map's log file, i see this for the most recent crash:

DB error - Lost connection to MySQL server during query [Fatal Error][0m [1;31mzoneutils::GetZoneIPP: Cannot find zone 110 [Debug][0m Message: Sent message 10 to message server [Debug][0m Message: Received message 10 from message server

very often the crashes seem to coincide with someone changing zones in the log, as well as this "sent/received message # to/from server".

what does show up in the dump, is the next step would be for sql.cpp within int32 Sql_QueryV, with the comment above it that says "executes a query". I am doing my best to figure out what is going on here but this is frustrating.

Warp-ass commented 9 years ago

just got a log full of "cannot load charid ##### DB Error - my sql server has gone away" but only for a large chunk of players. many others in other zones were fine, but had to force a server restart to free the mass of players.

Warp-ass commented 9 years ago

crashes have dropped in frequency from happening every 2-15 minute to now happening up to an hour apart. not sure what changed, other than how late it is. there are more players on now than before, my best guess is lobby server traffic or something, i don't know. i give up for today.

teschnei commented 9 years ago

there is really nothing i can do with this unless you provide a dump

it sounds like issues with your DB at any rate

takhlaq commented 9 years ago

https://www.dropbox.com/s/00z5cgmggkc2ur1/12-16-Recast.7z?dl=0

looks like a crappy char was loaded into the zone's m_charList CZone::IncreaseZoneCounter checks the targid before adding the char to the zone's char list so im assuming something went wrong after the char was added (maybe when the player was removed from zone in CZone::DecreaseZoneCounter)

Warp-ass commented 9 years ago

kind of a shot in the dark but after looking at it with demo for a while, it looks like it might have something to do with outpost warps. many of the autos show northern sandoria/bastok mines as the zone or prevzone

teschnei commented 9 years ago

have the log too? that'd help

Warp-ass commented 9 years ago

this log should encompass it, sorry for the size, just grabbed it quickly while working

i think the log starts yesterday afternoon around 1

https://www.dropbox.com/s/w97j7h2o8l59po3/scavenger-map-server.log?dl=0

i would tell you when the crash happened but it was demo that submitted that dump so i don't actually know, presumably some time around when he posted it.

teschnei commented 9 years ago

yeah... every time it crashes, you should rename the log and start a new one so it's easy to find where it crashed

Warp-ass commented 9 years ago

will do from now on.

Warp-ass commented 9 years ago

put in your latest commits, but still randomly getting "mysql server has gone away" whenever the server manages to stay up for over an hour

Warp-ass commented 9 years ago

reconsideringmy theories, many of the crashes are coming from cities, most recently in the last few days 100% of our recast crashes have come from lower jeuno, and checking the autos, all but one player's names will be available and the missing info will be a mess of scrambled data for a char. checking the log, the names of poeple that I can find in lower jeunoe before the crash going by increaseZoneCounter seem to always be zoning in very soon before the crash happens. I am leaning toward saying that his is happening during the process of them zoning in, or soon after. I am preparing another dump with a log that is more manageable.

bope12 commented 9 years ago

lower jeuno is the default player position i think

Warp-ass commented 9 years ago

https://www.dropbox.com/s/zivks53bgn4vhox/ScavengerRecast12-17-14.rar?dl=0

Warp-ass commented 9 years ago

could be, but some of the crashes before the past few days have been in northern sandoria, bastok markets and beaucedine glacier

teschnei commented 9 years ago

it's a char reference that got deleted but not removed from the charList, i'm still trying to track down how/why

Warp-ass commented 9 years ago

thanks for that. let me know if you need or want more dumps, anything i can do to help the process.

Warp-ass commented 9 years ago

also it sounds widespread to the point where this probably can't be done, but any ideas on how to at least prevent crashes by disabling something or ghetto fixing?

teschnei commented 9 years ago

no, it'd be easy to fix if i knew what the problem was

Warp-ass commented 9 years ago

updates:

  1. we have found that crashes can happen when a player changes gear immediately after logging in, presumably while still downloading data(sql.cpp query execution crash, while attempting to save character equipment)
  2. rumor: accepting party invites right after one of those involved in the invite has zoned (party.cpp line 524 return m_PartyID;)
  3. rumor: if players use an npc to warp somewhere else immediately after zoning (i.e. homepoint, then immediately use outpost warp npc) will cause the vector crash occasionally.
  4. attempts by the game to GetZoneIPP while PChar->status = STATUS_SHUTDOWN; (message.cpp line 76)

the server has a pattern of running decently stable for a while, then after a reboot can crash multiple times in a row within seconds of each one (has done it as much as 5 crashes within 60 seconds and 10 crashes within 5 minutes), which I would guess is somehow related to the situation of the first thing in this list and maybe even changing zones. Based on my best guess, I have tried telling the players to wait until Downloading Data is gone before doing any kind of input. I will try to collect dumps/logs on these as I go.

Warp-ass commented 9 years ago

just had a crash with autos that were almost exactly the same as the recast coming from an elevator in metalworks. first time i've seen this. have the dump for this if it's relevant, unless it's still the same issue and this doesn't help at all.

Warp-ass commented 9 years ago

to put into perspective how bad the crashes have been today, one player has been attempting to kill behemoth for the last 3 hours and the server hasn't stayed up long enough for them to kill it once, even with my help spawning it. maybe it's getting worse, maybe today is shitty luck. sorry to sound like i'm complaining, just trying to feed you as much info as possible.

teschnei commented 9 years ago

there's a reason i made it possible to have zones on different processes, use that in the meantime

i don't have time currently to look into anything (and i can't really drop anything to make time, especially if era is still a donation server)

Warp-ass commented 9 years ago

I just wanted to provide feedback on the new version which I've heard no one else does lately. but I understand. when you find the time I will provide you with all the debug info you need. for now I will have to revert to our early november version until I figure out the zone settings you suggested.

teschnei commented 9 years ago

fixed now