Firebird 2.1 would not always flush generator values [CORE3904]

firebird-automations commented 12 years ago

Submitted by: Ivan Arabadzhiev (intelrullz)

I have this database which gets corrupted for the 3rd time in a row in exactly the same way - after a power failure (faulty UPS battery, a curious kid near the UPS and so on ...) generators are at values which seem to be 2-3 weeks old. The machine is almost never shutdown so I figure the values I see are from the last proper shutdown. I believed the issue to be hardware related, but I replaced the PC (the first time) and disabled hard drive caching (the second time). No other data seems to be missing and the database is consistent (it backup/restores with no errors). The interesting part is 2 or 3 of the triggers (out of the 30 I checked) had correct values.

firebird-automations commented 12 years ago

Commented by: @dyemanov

Did you check the database consistence with gfix? If so, what was its report?

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

Also check the "Forced Write" setting for the database. With Windows, the setting should ALWAYS be ON.

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

As for the consistency check - unfortunately no (which I do believe to be somewhat idiotic but I`ll explain in a second) As I stated in the environment - FW are ON (database load is not high enough for the performance penalty to be a problem). Perhaps I should mention that the housekeeping interval is set to 50000 and pagesize is 16384.

Since I got a second call for the same database - here`s a quick time-line on what happened : 5 pm - Server has just been reset (UPS powered off by accident ...). Firebird does not complain about database but generators are out of date. While I fix the generators I backup/restore the database to a temporary location. Neither operation produces errors so I leave the users to work with the original FDB. I figured since I fixed the generators there is no need to wait an extra 15 minutes for the B/R ...

5:30 pm - the backup service manages to create a successful backup from the FDB in question (generators fixed, no other obvious issues)

6 pm - All hell breaks loose. Writing to database is impossible due to checksum errors. gfix reports about 600 page level error (mostly in indexes). And there some data from the readable part is missing (a few hundred rows from the most active tables). From the IDs of the missing records - the missing data was added somewhere around the last successful flush of the generators. I guess the database was inconsistent and it broke when extra pages needed to be allocated but I have no copy of the broken FDB to reproduce ...

The backup from 5:30 pm (which was done on the FDB that survived the power failure) restored with no issues at all and generators were in tact. No data up to 5:30pm was missing from it.

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

Ivan,

Sorry about the "Forced Write" question, missed it in the case details.

The reason that the consistency check is not reporting a problem is because the generator pages do not have record versions like other structures, so unless the page is completely munged the check will not report an error.

As to the root cause of the problem, I am _wondering_ if you are running into a problem with the SSD's cache (which is different from the OS cache).

Some consumer SSDs have on-board cache to improve performance. For some SSDs this cache applies to Write operations. The problem is that some SSDs are missing cache protection to ensure that any pending writes are correctly written to disk (usually involves a super-capacitor which provides power to the SSD for just long enough for the writes to complete). You might want to check whether this applies to the SSD.

Could you replace the SSD with an HDD for a brief period, for testing purposes?

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

Well the second time the database broke, I also figured something is wrong with caching on the drive (so I disabled the write cache option in device manager). The 3 failures are 2-3 weeks apart each, so it would take some time to put the database on a HDD (plus I`d have to sabotage work intentionally, since I have no problem with clean shutdown). Are you aware of any ways to disable the SSD cache in question? I can sacrifice the performance since I have 3 active users at most (which is the main reason the 'server' is running windows ...)

I filed a bug report because generator values are vastly wrong (they corresponded to records added on 24.07.2012). I`m guessing if there is a caching issue, things would get flushed at some points and will end up with generator values from, say, yesterday. And, which is even weirder - all the other data in the database is OK. The problem only seems to affect generators (and probably other metadata). It did, after all, work for about an hour after the failure and at least a few hundred small transactions were made before the 'side effects' came. Though, to be honest, I have been wrong before

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

The cache I am referring is unrelated to/different from the OS drive caching settings.

I am not aware of how to control SSD internal cache settings, I suspect that it may not be possible, a another type/model/brand of SSD may be required.

Create the case/issue was OK, it may turn out be a Firebird problem (generate pages not getting flushed as actively as data pages???). But given that the basic engine has been working on million+ servers for the last 15 years, I think it is *very doubtful*.

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

A citation from Wikipedia:

Battery or super capacitorAnother component in higher performing SSDs is a capacitor or some form of battery. These are necessary to maintain data integrity such that the data in the cache can be flushed to the drive when power is dropped; some may even hold power long enough to maintain data in the cache until power is resumed. In the case of MLC flash memory, a problem called lower page corruption can occur when MLC flash memory loses power while programming an upper page. The result is that data written previously and presumed safe can be corrupted if the memory is not supported by a super capacitor in the event of a sudden power loss. This problem does not exist with SLC flash memory. Most consumer-class SSDs do not have built-in batteries or capacitors; among the exceptions are the Intel 320 series and the more expensive Intel 710 series. http://en.wikipedia.org/wiki/Solid-state_drive

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

Doubtful, indeed. I have been using firebird basically from the first official release and such issues have upto now been rare enough to attribute to hardware failures.

So your point is that generator pages are not moved too often and writing an upper page could lead to reading a lower one and thus older generators (I`m not really sure how wear leveling fits into all this)? But if that is the case, shouldn`t I have some generators from yesterday, some from last week, some from the very first day, some going ahead (random junk, interpreted as a huge 64 bit int) and so on?

A little off-topic - I did use the same SSD on LinuxFromScratch for a few months. It worked ok right up to the moment it just died out of nowhere. I could`ve just been lucky. I still have my doubts about the Windows platform, though ...

firebird-automations commented 12 years ago

Commented by: @AlexPeshkoff

Ivan, almost for sure that's not firebird issue. Specially taking into an account that not only generators were damaged, but also data pages and a lot of index pages. The latest is specially typical for a case with non-flashed cache. Index is a btree, and therefore having some of related pages written, others - not, is easily detected by gfix (unlike data and generators). The worst for you is that sooner of all a lot of data pages are also NOT flashed when power failure happens - but you do not notice it at once. The reason why generators are not flashed may be very simple - they are used (and therefore modified) very often, and cache logic takes into an account only last time when page was modified. Foolish behavior for write cache, but... What about turning off cache in windows - only hell knows what does SSD's driver do on that command, specially if written by same people who developed that fine cache logic.

PS. I understand that any shit may happen (up to CPU/MB failure when even best UPS can't help), but looks like you have power failures too often. May be it's time to solve that problem?

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

Well, the "cache logic is designed by idiots" argument is my one true soft spot (the major reason I got into programming was poorly written software). So it`s poor hardware(or software) caching and cannot be helped by firebird itself? A shame, really ...

As for the power failures - most of my clients have 3-5 computers and no dedicated server room/rack. The database is question is on a 'bigger and better' workstation and the UPS is right next to it. I can`t really stop people from poking if they don`t take the database seriously enough :( The best solution I came up with was to write (yet another) backup tool, which seems stable enough for the moment.

Thank you both for the answers. You`ve been very helpful.

firebird-automations commented 12 years ago

Commented by: @AlexPeshkoff

Just one thought - do 5 computers produce such a load that SSD is really needed?

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

Well, that is a question I`ve been asking the guys who write the software for some time now. Still, the major product I support at the moment tends to get a bit sluggish after a few years. When I thought of ways to improve performance, an SSD seemed like a better choice than turning forced writes off. The database is not big, but some tables in it are updated like hell (and are a few million rows each ...).

I did run performance tests on a bigger client and the SSD did way better than the 15k rpm SAS drives, so I figured it`s not such a bad idea. The Corsair Force 3 drive is 3-5 times faster in some cases. On the other hand, going from an i3 to an i5/E3 cpu produces no measurable improvement so it`s not nearly as cost efficient upgrade. A single SSD is actually cheaper than a SAS RAID array and I don`t really need the extra storage space. I do keep a HDD for backups everywhere though :)

I do think optimizing database structure/logic is a better way to go but making big changes in another team`s software is not an easy task ...

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

Ivan,

Your comments suggest that you believe that the problem is not with Firebird but lies elsewhere. Accordingly, I believe that this case can be closed as "Won't fix".

firebird-automations commented 12 years ago

Commented by: Ivan Arabadzhiev (intelrullz)

Sean, The discussion has been very helpful and I believe there is a more generic issue at hand now. I`m not really sure Firebird can actually counteract it in any way. I`m also not sure if the issue is relevant on newer platforms (windows 7/8), so I guess it can be closed. Thank you again for all your help, it has given me stuff to think about.

firebird-automations commented 12 years ago

Commented by: Sean Leyne (seanleyne)

If further details are available that suggest ways that Firebird can improve the current functionality, then the case can be re-opened.

firebird-automations commented 12 years ago

Modified by: Sean Leyne (seanleyne)

status: Open \[ 1 \] =\> Resolved \[ 5 \] resolution: Won't Fix \[ 2 \]

firebird-automations commented 12 years ago

Modified by: @pcisar

status: Resolved \[ 5 \] =\> Closed \[ 6 \]

FirebirdSQL / firebird

Firebird 2.1 would not always flush generator values [CORE3904] #4240