FirebirdSQL / firebird

Firebird server, client and tools
https://www.firebirdsql.org/
1.19k stars 205 forks source link

Problem with parallel restore #7485

Closed tomaszdubiel18 closed 1 year ago

tomaszdubiel18 commented 1 year ago

Hello. I was testing how fast Firebird 5.0.0.958-0 restores the gbak with its new -parallel option. First I tested without -parallel, it took 8 minutes to restore (BTW, the gbak could not commit one index, the database was restored, but with errors: Database is not online due to failure to activate one or more indices.

OK, then I set all parallel values to 64 and launched gbak with -parallel 64. It ended after 2 minutes, but it stopped when failed to create the index (I assume that, because the final database's size is smaller). Moreover, it did not return any error message. I think that is not supposed to happen.

hvlad commented 1 year ago

Please, provide command line and few last lines of restore log. It is not clear to me what exact problem you describe.

tomaszdubiel18 commented 1 year ago

I used the same gbak from the issue where you fixed lack of informing about restore error. command: gbak.exe -rep -user SYSDBA -password password path_to_gbk path_to_fdb it ended with: gbak: ERROR: Foreign key reference target does not exist gbak: ERROR: Problematic key value is ("ID_NAGL" = 278228) gbak: ERROR:Database is not online due to failure to activate one or more indices. gbak: ERROR: Run gfix -online to bring database online without active indices.

However, when I ran: gbak.exe -rep -user SYSDBA -password password path_to_gbk path_to_fdb -parallel 64 I got NO OUTPUT. As far as I can tell, the process stopped on this indice error without any message.

hvlad commented 1 year ago

Please, check next snapshot build.

tomaszdubiel18 commented 1 year ago

Tested - without changes.

tomaszdubiel18 commented 1 year ago

You probably mean the next one after 961.

hvlad commented 1 year ago

Sure, next after 961

tomaszdubiel18 commented 1 year ago

Well, I tried with 964 and still the same.

tomaszdubiel18 commented 1 year ago

Maybe it changes something, because I do it with gbk which expects UDF function. When restoring, I get only this warning, but nothing else apart from it. obraz

hvlad commented 1 year ago

I fixed one reason of gbak crash when restoring in parallel and using remote (network) access to the server. Perhaps it was not your case. So, I need something to reproduce. Could you provide me with backup file you use ?

tomaszdubiel18 commented 1 year ago

I'm afraid it's not possible. There must be other way to help you fix this.

hvlad commented 1 year ago

BTW, does you check %errorlevel% after gbak ? Could you add -verbose switch to see when and how process stop ?

tomaszdubiel18 commented 1 year ago

I added -v switch - gbak simply activated and created last good index before the bad one and stopped. No sign of error. It looks like simple crash

hvlad commented 1 year ago

Could you provide a crash dump (with full memory) ?

tomaszdubiel18 commented 1 year ago

Please instruct me, how to do it.

hvlad commented 1 year ago

You may use WER or procdump, for example

tomaszdubiel18 commented 1 year ago

OK, I have it, its quite big, 468 MB, exceptions were thrown almost all the time. What data are present in such dump?

hvlad commented 1 year ago

Memory dump contains memory of process. It have no full image of database or backup file, if you concerns, but it have some data from backup file of course. 486MB is not a problem, and it should be good compressible.

tomaszdubiel18 commented 1 year ago

If you need more info, please tell me how to send it precisely to you: From WinDBG: Comment: ' procdump -ma -e -w gbak.exe Unhandled exception: C0000409' This dump file has an exception of interest stored in it. The stored exception information can be accessed via .ecxr. (264c.3e60): Security check failure or stack buffer overrun - code c0000409 (first/second chance not available) Subcode: 0x7 FAST_FAIL_FATAL_APP_EXIT For analysis of this file, run !analyze -v ucrtbase!abort+0x4e: 00007ff8`e001f61e cd29 int 29h

tomaszdubiel18 commented 1 year ago

From analysis: FAULTING_THREAD: 00003e60

STACK_TEXT:
0000000011c29830 00007ff8e001ed39 : 0000000000000003 0000000000000003 0000000011c29950 0000000000000000 : ucrtbase!abort+0x4e 0000000011c29860 00007ff8dffe80da : 0000000011c2a530 0000000011c29e50 0000000011c2a530 00007ff8e00a2000 : ucrtbase!terminate+0x29 0000000011c29890 00007ff8e2353eaf : 0000000000000000 00007ff8dffa0000 0000000000000081 0000000011c2a530 : ucrtbase!_C_specific_handler_noexcept+0x4a 0000000011c298c0 00007ff8e22ceae6 : 0000000011c2a530 00007ff8dffa0000 00007ff8dffc9363 00007ff8e00a4880 : ntdll!RtlpExecuteHandlerForException+0xf 0000000011c298f0 00007ff8e2352e9e : 0000000100000000 00000000000000f5 0000000011c2c240 0000000000006c70 : ntdll!RtlDispatchException+0x286 0000000011c2a040 00007ff8df94051c : 0000000000000000 0000000011c2c240 0000000011c2c730 0000000011c2c730 : ntdll!KiUserExceptionDispatch+0x2e 0000000011c2a7c0 00007ff8d8582743 : 00007ff8302d84b0 0000000011c2b7b8 00000e8c00000100 00007ff800000001 : KERNELBASE!RaiseException+0x6c 0000000011c2a8a0 00007ff8e23536f6 : 0000000000000001 0000000000000007 0000000000000000 0000000011c2c730 : VCRUNTIME140_1!FrameHandler4::CxxCallCatchBlock+0x1b3 0000000011c2a980 00007ff82fef1a21 : 0000000011c2fc00 0000000000000000 0000000011c2d1e8 0000000011c2fc00 : ntdll!RcConsolidateFrames+0x6 0000000011c2d070 00007ff83005e3fb : 0000000019bd9170 000000000079fff0 000000000079fff0 000000002c284e90 : engine13+0xc1a21 0000000011c2d890 00007ff830236b61 : 000000000014c938 0000000019bd9ce0 0000000019bd9ce0 000000002c284e90 : engine13+0x22e3fb 0000000011c2fe70 00007ff830236a08 : 000000002ab6dc00 0000000000000000 000000002c284e90 00007ff830236b80 : engine13!firebird_plugin+0x1a2701 0000000011c2fea0 00007ff8301fe5b7 : 0000000000000000 000000002c284e90 000000000dd5db60 000000002c284e90 : engine13!firebird_plugin+0x1a25a8 0000000011c2fed0 00007ff8dffc9363 : 00007ff830602ab8 0000000014e77310 000000000067df50 0000000000000000 : engine13!firebird_plugin+0x16a157 0000000011c2ff00 00007ff8e05026bd : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ucrtbase!thread_start<unsigned int (cdecl)(void ),1>+0x93 0000000011c2ff30 00007ff8e230dfb8 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : kernel32!BaseThreadInitThunk+0x1d 0000000011c2ff60 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ntdll!RtlUserThreadStart+0x28

STACK_COMMAND: ~7s ; .cxr ; kb

hvlad commented 1 year ago

Please, upload compressed dump at any place you wish (google disk, your company ftp, etc) and send link to me (fbvlad at gmail com)

The stack trace above is useless as WinDbg see no debug information (pdb) and can't resolve addresses.

tomaszdubiel18 commented 1 year ago

What exact data from the gbk file does the dump have? Current informations are not sufficient? As I can see, you have a concrete type of exception located in exact place in code.

aafemt commented 1 year ago

According to this page WinDBG can load debug symbols from files. You should try to load them from engine13.pdb before analyzing of stack.

hvlad commented 1 year ago

What exact data from the gbk file does the dump have?

At the crash moment - when index is built, by your words, gbak should not have any user data from backup file in memory. It have all tables\indices metadata, of course. Also, since you uses embedded restore, engine holds some data pages in buffer cache and pages of index that is currently builds.

Current informations are not sufficient?

No.

If you can't provide me with memory dump, you may download the same snapshot build with debug info, extract it, point WinDbg to it and generate stack trace once again.

tomaszdubiel18 commented 1 year ago

I will paste here in a few moments the analysis of the stack, according to @aafemt advice.

tomaszdubiel18 commented 1 year ago

Is it any better? I ran it with /f path to engine13.pdb

NTGLOBALFLAG: 0

PROCESS_BAM_CURRENT_THROTTLED: 0

PROCESS_BAM_PREVIOUS_THROTTLED: 0

APPLICATION_VERIFIER_FLAGS: 0

CONTEXT: (.ecxr) rax=0000000000000001 rbx=00000000119ba530 rcx=0000000000000007 rdx=00006d0ec2aaeab3 rsi=00000000119ba530 rdi=00000000119ba040 rip=00007ff8e001f61e rsp=00000000119b9830 rbp=00000000119b9e50 r8=00007ff8e00a0920 r9=00000000119b97a8 r10=0000000000000013 r11=c8000000000000ff r12=0000000000000000 r13=00000000119b9950 r14=0000000000000000 r15=00000000119ba040 iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 ucrtbase!abort+0x4e: 00007ff8`e001f61e cd29 int 29h Resetting default scope

EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ff8e001f61e (ucrtbase!abort+0x000000000000004e) ExceptionCode: c0000409 (Security check failure or stack buffer overrun) ExceptionFlags: 00000001 NumberParameters: 1 Parameter[0]: 0000000000000007 Subcode: 0x7 FAST_FAIL_FATAL_APP_EXIT

PROCESS_NAME: gbak.exe

ERROR_CODE: (NTSTATUS) 0xc0000409 - System wykry w tej aplikacji przekroczenie buforu opartego na stosie. Przekroczenie mo e umo liwi z o liwemu u ytkownikowi uzyskanie kontroli nad t aplikacj .

EXCEPTION_CODE_STR: c0000409

EXCEPTION_PARAMETER1: 0000000000000007

FAULTING_THREAD: 000047a4

STACK_TEXT:
00000000119b9830 00007ff8e001ed39 : 0000000000000003 0000000000000003 00000000119b9950 0000000000000000 : ucrtbase!abort+0x4e 00000000119b9860 00007ff8dffe80da : 00000000119ba530 00000000119b9e50 00000000119ba530 00007ff8e00a2000 : ucrtbase!terminate+0x29 00000000119b9890 00007ff8e2353eaf : 0000000000000000 00007ff8dffa0000 0000000000000081 00000000119ba530 : ucrtbase!_C_specific_handler_noexcept+0x4a 00000000119b98c0 00007ff8e22ceae6 : 00000000119ba530 00007ff8dffa0000 00007ff8dffc9363 00007ff8e00a4880 : ntdll!RtlpExecuteHandlerForException+0xf 00000000119b98f0 00007ff8e2352e9e : 0000000100000000 00000000000000f5 00000000119bc240 0000000000000a10 : ntdll!RtlDispatchException+0x286 00000000119ba040 00007ff8df94051c : 0000000000000000 00000000119bc240 00000000119bc730 00000000119bc730 : ntdll!KiUserExceptionDispatch+0x2e 00000000119ba7c0 00007ff8d8582743 : 00007ff8662584b0 00000000119bb7b8 0000e31f00000100 00007ff800000001 : KERNELBASE!RaiseException+0x6c 00000000119ba8a0 00007ff8e23536f6 : 0000000000000001 0000000000000007 0000000000000000 00000000119bc730 : VCRUNTIME140_1!FrameHandler4::CxxCallCatchBlock+0x1b3 00000000119ba980 00007ff865e71a21 : 00000000119bfc00 0000000000000000 00000000119bd1e8 00000000119bfc00 : ntdll!RcConsolidateFrames+0x6 00000000119bd070 00007ff865fde3fb : 000000001325c070 0000000001f4fff0 0000000001f4fff0 000000002a6bcf50 : engine13+0xc1a21 00000000119bd890 00007ff8661b6b61 : 000000000014c938 000000001b4bbb50 000000001b4bbb50 000000002a6bcf50 : engine13+0x22e3fb 00000000119bfe70 00007ff8661b6a08 : 000000001670eae0 0000000000000000 000000002a6bcf50 00007ff8661b6b80 : engine13!firebird_plugin+0x1a2701 00000000119bfea0 00007ff86617e5b7 : 0000000000000000 000000002a6bcf50 00000000132c1000 000000002a6bcf50 : engine13!firebird_plugin+0x1a25a8 00000000119bfed0 00007ff8dffc9363 : 00007ff866582ab8 00000000132c49a0 00000000004b5bd0 0000000000000000 : engine13!firebird_plugin+0x16a157 00000000119bff00 00007ff8e05026bd : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ucrtbase!thread_start<unsigned int (cdecl)(void ),1>+0x93 00000000119bff30 00007ff8e230dfb8 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : kernel32!BaseThreadInitThunk+0x1d 00000000119bff60 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ntdll!RtlUserThreadStart+0x28

STACK_COMMAND: ~7s ; .cxr ; kb

SYMBOL_NAME: ucrtbase!abort+4e

MODULE_NAME: ucrtbase

IMAGE_NAME: ucrtbase.dll

FAILURE_BUCKET_ID: FAIL_FAST_FATAL_APP_EXIT_c0000409_ucrtbase.dll!abort

OS_VERSION: 10.0.22621.1

BUILDLAB_STR: ni_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

IMAGE_VERSION: 6.2.22621.608

FAILURE_ID_HASH: {e31753ac-c98a-8055-3663-47e707543d20}

Followup: MachineOwner

aafemt commented 1 year ago

If your dump is from Windows 64 snapshot 964 then exception happen in MET_update_generator_increment() called from IndexCreateTask::handler() which makes no sense for me. But MET part can be trash from stack corruption.

@tomaszdubiel18 Do you see function names and line numbers after "engine13+0x22e3fb"? No. It means that this is useless (unless you want Vlad to waste time digging MAP files).

tomaszdubiel18 commented 1 year ago

I'm not an expert in this. I used your advice and it seems it didnt help. What could I do wrong?

hvlad commented 1 year ago

If you run WinDbg in interactive mode:

There should be response like:

************* Path validation summary **************
Response                         Time (ms)     Location
OK                                             Z:\Temp\gh-7481\firebird_3_0
OK                                             Z:\Temp\gh-7481\firebird_3_0\plugins

here I exctract Firebird snapshot into Z:\Temp\gh-7481\firebird_3_0 folder

PS It will be much more easy and save a lot of time if you provide me with that dump

tomaszdubiel18 commented 1 year ago

Sorry, but if I'm not sure what the data are present in this dump, I wont send it. One more time: it looks like now it has more data in it. Please confirm:

STACK_TEXT:
00000000119b9830 00007ff8e001ed39 : 0000000000000003 0000000000000003 00000000119b9950 0000000000000000 : ucrtbase!abort+0x4e 00000000119b9860 00007ff8dffe80da : 00000000119ba530 00000000119b9e50 00000000119ba530 00007ff8e00a2000 : ucrtbase!terminate+0x29 00000000119b9890 00007ff8e2353eaf : 0000000000000000 00007ff8dffa0000 0000000000000081 00000000119ba530 : ucrtbase!_C_specific_handler_noexcept+0x4a 00000000119b98c0 00007ff8e22ceae6 : 00000000119ba530 00007ff8dffa0000 00007ff8dffc9363 00007ff8e00a4880 : ntdll!RtlpExecuteHandlerForException+0xf 00000000119b98f0 00007ff8e2352e9e : 0000000100000000 00000000000000f5 00000000119bc240 0000000000000a10 : ntdll!RtlDispatchException+0x286 00000000119ba040 00007ff8df94051c : 0000000000000000 00000000119bc240 00000000119bc730 00000000119bc730 : ntdll!KiUserExceptionDispatch+0x2e 00000000119ba7c0 00007ff8d8582743 : 00007ff8662584b0 00000000119bb7b8 0000e31f00000100 00007ff800000001 : KERNELBASE!RaiseException+0x6c 00000000119ba8a0 00007ff8e23536f6 : 0000000000000001 0000000000000007 0000000000000000 00000000119bc730 : VCRUNTIME140_1!__FrameHandler4::CxxCallCatchBlock+0x1b3 00000000119ba980 00007ff865e71a21 : 00000000119bfc00 0000000000000000 00000000119bd1e8 00000000119bfc00 : ntdll!RcConsolidateFrames+0x6 00000000119bd070 00007ff865fde3fb : 000000001325c070 0000000001f4fff0 0000000001f4fff0 000000002a6bcf50 : engine13!MET_scan_relation+0xdd1 00000000119bd890 00007ff8661b6b61 : 000000000014c938 000000001b4bbb50 000000001b4bbb50 000000002a6bcf50 : engine13!Jrd::IndexCreateTask::handler+0x34b 00000000119bfe70 00007ff8661b6a08 : 000000001670eae0 0000000000000000 000000002a6bcf50 00007ff8661b6b80 : engine13!Firebird::Worker::work+0x51 00000000119bfea0 00007ff86617e5b7 : 0000000000000000 000000002a6bcf50 00000000132c1000 000000002a6bcf50 : engine13!Firebird::WorkerThread::threadRoutine+0x68 00000000119bfed0 00007ff8dffc9363 : 00007ff866582ab8 00000000132c49a0 00000000004b5bd0 0000000000000000 : engine13!threadStart+0x67 00000000119bff00 00007ff8e05026bd : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ucrtbase!thread_start<unsigned int (__cdecl)(void ),1>+0x93 00000000119bff30 00007ff8e230dfb8 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : kernel32!BaseThreadInitThunk+0x1d 00000000119bff60 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ntdll!RtlUserThreadStart+0x28

hvlad commented 1 year ago

Much better but still not allows to make conclusion. Try command k to produce a stack trace with source file\line numbers.

tomaszdubiel18 commented 1 year ago

Im new to this. I should add -k to procdump command or somewhere in WinDbg?

hvlad commented 1 year ago

At Windbg console. Just k, without slash\dash\etc.

So far I can say that looks like the crash happens when worker thread loads relation metadata and handle some computed field.

tomaszdubiel18 commented 1 year ago

Child-SP RetAddr Call Site

00 00000000119b9830 00007ff8e001ed39 ucrtbase!abort+0x4e 01 00000000119b9860 00007ff8dffe80da ucrtbase!terminate+0x29 02 00000000119b9890 00007ff8e2353eaf ucrtbase!_C_specific_handler_noexcept+0x4a 03 00000000119b98c0 00007ff8e22ceae6 ntdll!RtlpExecuteHandlerForException+0xf 04 00000000119b98f0 00007ff8e2352e9e ntdll!RtlDispatchException+0x286 05 00000000119ba040 00007ff8df94051c ntdll!KiUserExceptionDispatch+0x2e 06 00000000119ba7c0 00007ff8d8582743 KERNELBASE!RaiseException+0x6c 07 00000000119ba8a0 00007ff8e23536f6 VCRUNTIME140_1!FrameHandler4::CxxCallCatchBlock+0x1b3 [d:\a01_work\12\s\src\vctools\crt\vcruntime\src\eh\frame.cpp @ 1417] 08 00000000119ba980 00007ff865e71a21 ntdll!RcConsolidateFrames+0x6 09 00000000119bd070 00007ff865fde3fb engine13!MET_scan_relation+0xdd1 [C:\firebird-build\gen\jrd\met.cpp @ 9583] 0a 00000000119bd890 00007ff8661b6b61 engine13!Jrd::IndexCreateTask::handler+0x34b [C:\firebird-build\src\jrd\idx.cpp @ 463] 0b 00000000119bfe70 00007ff8661b6a08 engine13!Firebird::Worker::work+0x51 [C:\firebird-build\src\common\Task.cpp @ 136] 0c 00000000119bfea0 00007ff86617e5b7 engine13!Firebird::WorkerThread::threadRoutine+0x68 [C:\firebird-build\src\common\Task.cpp @ 61] 0d (Inline Function) ---------------- engine13!anonymous-namespace'::ThreadArgs::run+0x5 [C:\firebird-build\src\common\ThreadStart.cpp @ 78] 0e 00000000119bfed0 00007ff8dffc9363 engine13!threadStart+0x67 [C:\firebird-build\src\common\ThreadStart.cpp @ 97] 0f 00000000119bff00 00007ff8e05026bd ucrtbase!thread_start<unsigned int (cdecl)(void ),1>+0x93 10 00000000119bff30 00007ff8e230dfb8 kernel32!BaseThreadInitThunk+0x1d 11 00000000119bff60 0000000000000000 ntdll!RtlUserThreadStart+0x28

hvlad commented 1 year ago

Thanks, it confirms my prior message. But this is all I can say with this info. Dump allows to get more info diving into local variables and so on.

tomaszdubiel18 commented 1 year ago

You need to work with this.

aafemt commented 1 year ago

Im new to this.

Then you should learn WinDBG and provide useful information. Start from values of parameters and local variables in MET_scan_relation().

tomaszdubiel18 commented 1 year ago

With your help I can do everything ;)

hvlad commented 1 year ago

I can't work with limited info you provided, sorry.

tomaszdubiel18 commented 1 year ago

Please, do not expect from me to post probably sensitive data from my customer database. If you need more info, please tell me what to do further. Once I learn, I won't ask for that again.

aafemt commented 1 year ago

Learn this: https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/accessing-local-variables and whole topic.

tomaszdubiel18 commented 1 year ago

MET_scan_relation: relation 0x18b15900 Jrd::jrd_rel type_rel (9) 0x1325c070 0x115 0x2 0x40110 0x0 {...} {...} {...} 0x0 0x4418970 0x0 {...} 0x0 {...} 0x0 0x0 0 0x18b06b10 0x18b069d0 0x18b06a70 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 {...} {...} {...} {...} {...} 0x0 {...} 0x0
tdbb 0x119bfc00 Jrd::thread_db
0x0 tddDBB (2) 100 10 0x1325c070 0x1f152c0 0x17e81dc0 0x182c1dc0 0x0 0x1f01430 0x182c2028 0x17e81fb0 0x1f159a0 0x119bfbd0 75 0x90 0x0 {...} 0x132c49a0 {...}

LocalVariables - there is nothing

aafemt commented 1 year ago

Learn more. You must change frame to see local variables.

tomaszdubiel18 commented 1 year ago

Am I in good place? obraz

aafemt commented 1 year ago

Almost. You are one frame up from needed one (or down depending on POV). Study https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/dv--display-local-variables- to get values in text format instead of pictures.

tomaszdubiel18 commented 1 year ago

Frame 0xb this:0x1b4bbb50 workItem:0x1670eae0 I tried dv, but got: Unable to enumerate locals, Win32 error 0n87 Private symbols (symbols.pri) are required for locals.

aafemt commented 1 year ago

Oh, c'mon, use brain. Interesting frame is MET_scan_relation() but you went in contrary direction - to Worker::work().

tomaszdubiel18 commented 1 year ago

As I already posted here, in this frame there are no LocalVariables - only Attributes and Parameters. Dont you have any chat to move on instead of prolonging this issue?

aafemt commented 1 year ago

Instead of wasting time to chat you'd better provide a remote desktop for Vlad.

tomaszdubiel18 commented 1 year ago

If that's your attitude...

aafemt commented 1 year ago

That's your attitude. You cannot do the job yourself and you refuse to provide needed remedies to someone who can. I see no exit from this other than you don't use the feature and wait while someone more cooperative encounter it.