caetanosauer / zero

Fork of the Shore-MT storage manager used by the research project Instant Recovery
Other
29 stars 10 forks source link

"assertion failure: p.lsn() == emlsn" in zero/src/sm/log_spr.cpp:92 #23

Closed caetanosauer closed 8 years ago

caetanosauer commented 8 years ago

From @iMax3060 on April 14, 2016 22:35

Sometimes when executing zapps kits -b tpcc --duration 30 --load with/without other parameters I get the following output:

140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:304:set_pd: DB set to (1)
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:261:set_sf: New Scaling factor: 2.0
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:244:set_qf: New Queried Factor: 2.0
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:979:print_cpus: MaxCPU=(0) - ActiveCPU=(0)
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:778:configure_sm: Configuring Shore...
Set option db-worker-sli to 0
Set option eager to 1
Set option load to 1
Set option skew to 0
Set option sm_format to 0
Set option sm_log_delete_old_partitions to 1
Set option sm_logging to 1
Set option sm_shutdown_clean to 1
Set option sm_truncate_log to 0
Set option spread to 1
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:807:start_sm: Starting Shore...
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (247) [0.005] Initializing lock manager
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (254) [129.643] Initializing log manager (part 1)
Initialized curr_lsn to 1.0
Log _start 0 end_byte() 0
Log _curr_lsn 1.0 _durable_lsn 1.0
Curr epoch  base_lsn 1.0  base 0  start 0  end 0
Old epoch  base_lsn 0.0  base 0  start 0  end 0
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (261) [6.792] Initializing log manager (part 2)
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (273) [0.074] Initializing log archiver
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (282) [0.017] Initializing restart manager
[7f7a385f5700] /home/***/zero/src/sm/restart.cpp (136) Log analysis found 0 dirty pages and 0 active transactions
ACTIVE TRANSACTIONS
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (292) [0.945] Initializing volume manager
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (299) [0.057] Initializing buffer manager
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (306) [10.362] Building volume manager caches
STNODE CACHE:
last_extent: 0
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (321) [0.598] Initializing buffer cleaner and other services
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (340) [0.529] Performing offline recovery
[7f7a385f5700] /home/***/zero/src/sm/sm.cpp (372) [0.055] Finished SM initialization
[7f7791c1d700] /home/***/zero/src/sm/restart.cpp (394) Finished concurrent REDO of 0 pages
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:542:start: ShoreEnv initialized
140162908509952: /home/***/zero/src/cmd/kits/shore_env.cpp:548:start: Starting ()
140162908509952: /home/***/zero/src/cmd/kits/tpcc/tpcc_env.cpp:323:info: SF      = (2.0)
140162908509952: /home/***/zero/src/cmd/kits/tpcc/tpcc_env.cpp:324:info: Workers = (8)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: WAREHOUSE 2 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: DISTRICT 3 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: CUSTOMER 4 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: C_NAME_IDX 5 (latch) (no relaxed) (no unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: HISTORY 6 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: NEW_ORDER 7 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: ORDER 8 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: O_CUST_IDX 9 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: ORDERLINE 10 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: ITEM 11 (latch) (no relaxed) (unique)
CR: /home/***/zero/src/cmd/kits/table_desc.cpp:124:create_physical_index: STOCK 12 (latch) (no relaxed) (unique)
LD-%d: /home/***/zero/src/cmd/kits/tpcc/tpcc_env.cpp:173:work: Finished loading units 2100 .. 2000 
0
assertion failure: p.lsn() == emlsn
1. error in /home/***/zero/src/sm/log_spr.cpp:92 Assertion failed
    called from:
    0) /home/***/zero/src/sm/log_spr.cpp:92

Copied from original issue: iMax3060/zero#1

caetanosauer commented 8 years ago

Hi Max,

it seems weird that a scale factor of 2 with 8 workers is being picked by default. The default options should be SF 1 with 4 workers.

But of course, this is no excuse. There's probably something wrong in the kits loader code. We can see that in the line "Finished loading units 2100 .. 2000" -- this obviously doesn't make sense, because the units should only go up to 2000, and the first number should always be smaller than the second.

I'm checking this further now.

caetanosauer commented 8 years ago

Fixed in master. Please rebase your branch.

caetanosauer commented 8 years ago

From @iMax3060 on April 17, 2016 0:16

The scale factor and the number of workers wasn't a problem here. I executed the benchmark several times with different parameters and just picked one output as an example. But I had the same issue (the assertion failure) in the most times I executed the program independently from which parameters I have chosen. And I still get the assertion failure as described.

caetanosauer commented 8 years ago

Could you get a stack trace with gdb? Just run:

gdb --args zapps kits ...

When the assertion failure happens, you can type backtrace or just bt in the console.

It looks like pages are being evicted with dirty updates, causing single-page recovery, which then fails. There's a bug with recovery that I'm currently trying to fix -- this may be related to that.

caetanosauer commented 8 years ago

I just got this assertion failure myself -- working on it now

caetanosauer commented 8 years ago

This failure is happening very rarely for me, but so far I manager to narrow it down to a bug in the page cleaner in combination with instant restart and single-page recovery. If it's bothering you, I think setting --sm_cleaner_ignore_metadata true should make it go away for now. Alternatively, you can disable instant restart with --sm_restart_instant false

I will try to fix this and other bugs over the week.

caetanosauer commented 8 years ago

Fixed in f16e5e7, now in master