FirebirdSQL / firebird

Firebird server, client and tools
https://www.firebirdsql.org/
1.24k stars 215 forks source link

Dataloss after sweep #7182

Open wurdak314 opened 2 years ago

wurdak314 commented 2 years ago

This is the 2nd time I have lost data in a database. First in version 3.0.7 and after an update also in version 3.0.9 (gfix version WI-V3.0.9.33560 Firebird 3.0) (Windows x64 - version on Windows Server 2016 Datacenter in azure)

In a table, around 10 out of 15,000 data records suddenly disappeared. Since this table is in a master-detail relationship and the related detail records still exist, this shouldn't be possible.

Master table: MATERIAL with primary key MANUMM Detail table: MATLAGER with the foreign key MLAMANUMM

ALTER TABLE MATLAGER ADD FOREIGN KEY (MLAMANUMM) REFERENCES MATERIAL (MANUMM) ON DELETE CASCADE; with the following query I found all lost records:

Select distinct MLAMANUM
from MATLAGER
where not exists (Select MANUMM from MATERIAL where MANUMM=MLAMANUMM)

In a correct database this query shouldn´t return any records if the foreign key is active.

I was able to repair the database again by adding the MATERIAL records again. The system did not crash, but it was noticeable that both data losses happened shortly after an automatic sweep (interval 20000). The sweeps in the log file ran without errors: Here are the entries from the log file:

DLO1S006 Thu Mar 03 03:45:02 2022
Sweep is started by SWEEPER
Database "karodb"
OIT 2233911, OAT 2296458, OST 2296458, Next 2296459

DLO1S006 Thu Mar 03 03:56:03 2022
Sweep is finished
Database "karodb"
OIT 2296457, OAT 2296458, OST 2296458, Next 2296461

.....

DLO1S006 Thu May 12 13:22:13 2022
Sweep is started by SWEEPER
Database "c:\karo3\db\karo.db"
OIT 1174084, OAT 1200037, OST 1200037, Next 1203688

DLO1S006 Thu May 12 13:26:51 2022
Sweep is finished
Database "c:\karo3\db\karo.db"
OIT 1200036, OAT 1200037, OST 1200037, Next 1203832

I have temporarily set the sweepinterval to 0 and will carry out the sweep manually in the future. Whats else can I do?

hvlad commented 2 years ago

Run validation on regular basis (gfix -val -full) and check for messages at firebird.log put by validation

wurdak314 commented 2 years ago

Thank you for the tip. I have tried the gfix validation on the saved database after the first data loss. gfix gives only a warning:

Summary of validation errors
        Number of record level warnings : 1

I didn´ t find much information about this warning message.

Do you think its possible that the sweep causes the data loss?

hvlad commented 2 years ago

More information is in firebird.log. Warnings are harmless, don't worry. I never seen data loss because of sweep. So far there is not enough information to make any conclusion.

wurdak314 commented 2 years ago

There are no special entries in the firebird.log. Only the Sweep messages as you see in my first post and connection aborts like

INET/inet_error: read errno = 10054, client host = xxxx, address = x.x.x.x/60754, user = xxx

livius2 commented 2 years ago

Not possible. When you saw "Number of record level warnings : 1" than it must be relevant message in the firebird.log

wurdak314 commented 2 years ago

strange but possible. I renamed the firebird.log file to start a fresh one and now gfix writes into the log. I checked older DBs with this results:

Summary of validation errors Number of pointer page warnings : 986

log file: ...... DLO1S006 Wed May 18 06:51:40 2022 Database: C:\KARO3\DB\KARO.DB_12_5_AMABEND Warning: Pointer page 32545 {sequence 2} bits {0x08 secondary} are not consistent with data page 66869 {sequence 2349} state {0x09 full, secondary} in table MATERIAL (186)

DLO1S006 Wed May 18 06:51:40 2022 Database: C:\KARO3\DB\KARO.DB_12_5_AMABEND Warning: Pointer page 32545 {sequence 2} bits {0x05 full, swept} are not consistent with data page 603048 {sequence 2352} state {0x01 full} in table MATERIAL (186)

DLO1S006 Wed May 18 06:51:40 2022 Database: C:\KARO3\DB\KARO.DB_12_5_AMABEND Warning: Pointer page 32545 {sequence 2} bits {0x05 full, swept} are not consistent with data page 603050 {sequence 2354} state {0x01 full} in table MATERIAL (186)

DLO1S006 Wed May 18 06:51:40 2022 Database: C:\KARO3\DB\KARO.DB_12_5_AMABEND Warning: Pointer page 32545 {sequence 2} bits {0x05 full, swept} are not consistent with data page 603052 {sequence 2356} state {0x01 full} in table MATERIAL (186)

DLO1S006 Wed May 18 06:51:52 2022 Database: C:\KARO3\DB\KARO.DB_12_5_AMABEND Validation finished: 0 errors, 986 warnings, 986 fixed

and in another file Summary of validation errors Number of record level warnings : 1

DLO1S006 Wed May 18 06:55:46 2022 Database: C:\KARO3\DB\KARO_KAPUTT030322.DB Validation started

DLO1S006 Wed May 18 07:01:57 2022 Database: C:\KARO3\DB\KARO_KAPUTT030322.DB Warning: Relation has 40 orphan backversions {0 in use} in table MATERIAL (186)

DLO1S006 Wed May 18 07:02:11 2022 Database: C:\KARO3\DB\KARO_KAPUTT030322.DB Validation finished: 0 errors, 1 warnings, 0 fixed

hvlad commented 2 years ago

A lot of warnings on pointer pages is not good. Validation fixed it all, btw. It happens sooner of all after crash (or kill) of server process. To be safe, I recommend to run validation immediately before the manual sweep. And keep an eye on validation results.

wurdak314 commented 2 years ago

I have now started a check with gfix again and then carried out a manual sweep. The data is OK. However, there were pointer page warnings again. Always only in the table MATERIAL . What can be the reason for this? Is there anything special about this table? What can I learn from the log file entries?

DLO1S006    Fri May 20 22:21:15 2022
    Database: karodb
    Warning: Pointer page 505 {sequence 0} bits {0x05 full, swept} are not consistent with data page 66553 {sequence 513} state {0x01 full} in table MATERIAL (186)
hvlad commented 2 years ago

Usually such inconsistencies related with not correct stop of database without flushing of page cache. Check also forced writes setting on that db. I'll re-check careful writes logic around swept flag, probably we have a problem here. In any case - I can't imagine reason for such inconsistency in normal\correct operational environment. Of course it doesn't means we will not try to find and fix the possible bug.