bitwarden / server

Bitwarden infrastructure/backend (API, database, Docker, etc).
https://bitwarden.com
Other
15.33k stars 1.28k forks source link

MSSQL container crashes because app/Icons.dll causes high cpu/disk after login #309

Closed ruffy91 closed 6 years ago

ruffy91 commented 6 years ago

After login in on the web interface the mssql container has 100% CPU load and reads with 50MB/s (~1k IOPS) for a period of time up to 5 minutes. Sometimes the container even crashes and restarts.

Where can I begin to troubleshoot? I could not see any suspicious messages in the mssql logs.

Platform is a VM with Ubuntu 18.04, 1 CPU Core (Skylake-SP), 2GB RAM and NVME Storage.

kspearrin commented 6 years ago

Does this happen every time you log into the web interface?

ruffy91 commented 6 years ago

No it does not happen every time and until now I was not able to reproduce it reliable.

It seems to me that this only happens after the first login after a period of inactivity.

Are there any additional logs for the mssql container which can be enabled? core dumps?

kspearrin commented 6 years ago

All logs for the mssql instance should be under the ./bwdata/logs/mssql directory.

ruffy91 commented 6 years ago

errorlog contains no events during the event. Also the two .xel files HkEngineEventFile and system_health contain no events. errorlog and sqlagent contain events concerning the recovery AFTER the container restarted.

Which container is the "client/consumer" of the mssql DB? Maybe I can find more information there.

kspearrin commented 6 years ago

bitwarden-mssql is the database and all other containers, such as bitwarden-api, are clients that connect to that database.

ruffy91 commented 6 years ago

So it just happened again. I log in and immediately choose an entry to edit and the mssql container just crashed again.

Api Log:

2018-06-16 07:55:15.242 +00:00 [Error] Connection id ""0HLEH81MAKSB8"", Request id ""0HLEH81MAKSB8:00000001"": An unhandled exception was thrown by the application. System.InvalidOperationException: Response Content-Length mismatch: too many bytes written (167 of 0). at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Frame.VerifyAndUpdateWrite(Int32 count) at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Frame.d210.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Frame.d201.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.WebUtilities.HttpResponseStreamWriter.d26.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Mvc.Formatters.JsonOutputFormatter.d11.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.d19.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.d24.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Rethrow(ResultExecutedContext context) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.d22.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Rethrow(ResourceExecutedContext context) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.d17.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Mvc.Internal.ResourceInvoker.d15.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Builder.RouterMiddleware.d4.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Bit.Core.Utilities.CurrentContextMiddleware.d2.MoveNext() in /Users/kyle/Projects/bitwarden/core/src/Core/Utilities/CurrentContextMiddleware.cs:line 90 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.d6.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Cors.Infrastructure.CorsMiddleware.d7.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Bit.Core.Utilities.ServiceCollectionExtensions.<>c.<b9_0>d.MoveNext() in /Users/kyle/Projects/bitwarden/core/src/Core/Utilities/ServiceCollectionExtensions.cs:line 330 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Hosting.Internal.RequestServicesContainerMiddleware.d3.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Frame`1.d2.MoveNext() 2018-06-16 08:29:25.316 +00:00 [Error] A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 2 - Connection was terminated) System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 2 - Connection was terminated) at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParserStateObject.ReadAsyncCallbackCaptureException(TaskCompletionSource1 source) ClientConnectionId:e7d90978-41a5-41e3-a942-5eb7582ff5d7

mssql errorlog before crash:

2018-06-16 08:03:22.65 Server Microsoft SQL Server 2017 (RTM-CU7) (KB4229789) - 14.0.3026.27 (X64) May 10 2018 12:38:11 Copyright (C) 2017 Microsoft Corporation Express Edition (64-bit) on Linux (Ubuntu 16.04.4 LTS) 2018-06-16 08:03:22.66 Server UTC adjustment: 0:00 2018-06-16 08:03:22.66 Server (c) Microsoft Corporation. 2018-06-16 08:03:22.66 Server All rights reserved. 2018-06-16 08:03:22.67 Server Server process ID is 4120. 2018-06-16 08:03:22.67 Server Logging SQL Server messages in file '/var/opt/mssql/log/errorlog'. 2018-06-16 08:03:22.67 Server Registry startup parameters: -d /var/opt/mssql/data/master.mdf -l /var/opt/mssql/data/mastlog.ldf -e /var/opt/mssql/log/errorlog 2018-06-16 08:03:22.68 Server SQL Server detected 1 sockets with 1 cores per socket and 1 logical processors per socket, 1 total logical processors; using 1 logical processors based on SQL Server licensing. This is an informational message; no user action is required. 2018-06-16 08:03:22.69 Server SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required. 2018-06-16 08:03:22.69 Server Detected 1556 MB of RAM. This is an informational message; no user action is required. 2018-06-16 08:03:22.70 Server Using conventional memory in the memory manager. 2018-06-16 08:03:23.02 Server Buffer pool extension is already disabled. No action is necessary. 2018-06-16 08:03:23.25 Server InitializeExternalUserGroupSid failed. Implied authentication will be disabled. 2018-06-16 08:03:23.25 Server Implied authentication manager initialization failed. Implied authentication will be disabled. 2018-06-16 08:03:23.29 Server Successfully initialized the TLS configuration. Allowed TLS protocol versions are ['1.0 1.1 1.2']. Allowed TLS ciphers are ['ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:!DHE-RSA-AES256-GCM-SHA384:!DHE-RSA-AES128-GCM-SHA256:!DHE-RSA-AES256-SHA:!DHE-RSA-AES128-SHA']. 2018-06-16 08:03:23.36 Server The maximum number of dedicated administrator connections for this instance is '1' 2018-06-16 08:03:23.37 Server Node configuration: node 0: CPU mask: 0x0000000000000001:0 Active CPU mask: 0x0000000000000001:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required. 2018-06-16 08:03:23.39 Server Using dynamic lock allocation. Initial allocation of 2500 Lock blocks and 5000 Lock Owner blocks per node. This is an informational message only. No user action is required. 2018-06-16 08:03:23.40 Server In-Memory OLTP initialized on lowend machine. 2018-06-16 08:03:23.63 Server Database Instant File Initialization: enabled. For security and performance considerations see the topic 'Database Instant File Initialization' in SQL Server Books Online. This is an informational message only. No user action is required. 2018-06-16 08:03:23.66 Server Query Store settings initialized with enabled = 1, 2018-06-16 08:03:23.67 spid6s Starting up database 'master'. 2018-06-16 08:03:23.69 Server Software Usage Metrics is disabled. 2018-06-16 08:03:24.45 spid6s 41 transactions rolled forward in database 'master' (1:0). This is an informational message only. No user action is required. 2018-06-16 08:03:24.47 spid6s 0 transactions rolled back in database 'master' (1:0). This is an informational message only. No user action is required. 2018-06-16 08:03:24.48 spid6s Recovery is writing a checkpoint in database 'master' (1). This is an informational message only. No user action is required. 2018-06-16 08:03:24.80 spid6s Service Master Key could not be decrypted using one of its encryptions. See sys.key_encryptions for details. 2018-06-16 08:03:24.82 spid6s An error occurred during Service Master Key initialization. SQLErrorCode=33095, State=8, LastOsError=0. 2018-06-16 08:03:25.06 spid6s SQL Server Audit is starting the audits. This is an informational message. No user action is required. 2018-06-16 08:03:25.08 spid6s SQL Server Audit has started the audits. This is an informational message. No user action is required. 2018-06-16 08:03:25.45 spid6s SQL Trace ID 1 was started by login "sa". 2018-06-16 08:03:25.47 spid6s Server name is '4a72dc3f2b5d'. This is an informational message only. No user action is required. 2018-06-16 08:03:25.51 spid21s Always On: The availability replica manager is starting. This is an informational message only. No user action is required. 2018-06-16 08:03:25.52 spid21s Always On: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required. 2018-06-16 08:03:25.55 spid6s Starting up database 'msdb'. 2018-06-16 08:03:25.55 spid9s Starting up database 'mssqlsystemresource'. 2018-06-16 08:03:25.57 spid9s The resource database build version is 14.00.3026. This is an informational message only. No user action is required. 2018-06-16 08:03:25.64 spid9s Starting up database 'model'. 2018-06-16 08:03:26.59 spid18s A self-generated certificate was successfully loaded for encryption. 2018-06-16 08:03:26.62 spid18s Server is listening on [ 'any' 1433]. 2018-06-16 08:03:26.66 spid18s Dedicated administrator connection support was not started because it is disabled on this edition of SQL Server. If you want to use a dedicated administrator connection, restart SQL Server using the trace flag 7806. This is an informational message only. No user action is required. 2018-06-16 08:03:26.74 spid18s SQL Server is now ready for client connections. This is an informational message; no user action is required. 2018-06-16 08:03:26.81 spid6s 2 transactions rolled forward in database 'msdb' (4:0). This is an informational message only. No user action is required. 2018-06-16 08:03:26.84 spid6s 0 transactions rolled back in database 'msdb' (4:0). This is an informational message only. No user action is required. 2018-06-16 08:03:26.85 spid6s Recovery is writing a checkpoint in database 'msdb' (4). This is an informational message only. No user action is required. 2018-06-16 08:03:26.86 spid9s Polybase feature disabled. 2018-06-16 08:03:26.87 spid9s Clearing tempdb database. 2018-06-16 08:03:28.56 spid9s Starting up database 'tempdb'. 2018-06-16 08:03:29.20 spid21s The Service Broker endpoint is in disabled or stopped state. 2018-06-16 08:03:29.21 spid21s The Database Mirroring endpoint is in disabled or stopped state. 2018-06-16 08:03:29.22 spid21s Service Broker manager has started. 2018-06-16 08:03:29.31 spid6s Recovery is complete. This is an informational message only. No user action is required. 2018-06-16 08:03:30.23 spid51 Attempting to load library 'xpsqlbot.dll' into memory. This is an informational message only. No user action is required. 2018-06-16 08:03:30.25 spid51 Using 'xpsqlbot.dll' version '2017.140.3026' to execute extended stored procedure 'xp_qv'. This is an informational message only; no user action is required. 2018-06-16 08:03:30.34 spid51 Attempting to load library 'xpstar.dll' into memory. This is an informational message only. No user action is required. 2018-06-16 08:03:30.48 spid51 Using 'xpstar.dll' version '2017.140.3026' to execute extended stored procedure 'xp_sqlagent_notify'. This is an informational message only; no user action is required. 2018-06-16 08:08:30.45 spid52 Attempting to load library 'xplog70.dll' into memory. This is an informational message only. No user action is required. 2018-06-16 08:08:30.58 spid52 Using 'xplog70.dll' version '2017.140.3026' to execute extended stored procedure 'xp_msver'. This is an informational message only; no user action is required. 2018-06-16 08:08:31.15 spid52 Starting up database 'vault'. 2018-06-16 08:08:31.55 spid52 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 08:08:31.62 spid52 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 08:22:50.73 spid54 Starting up database 'vault'. 2018-06-16 08:22:51.19 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 08:22:51.27 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1].

The only thing I can see is that the database vault was unexpectedly restarted. 08:29 is the time where the crash happened

So the timeline is: 07:59 while I am not logged in to the web interface (but while using the firefox extension) there is a unhandled exception in the Api log. 08:03 mssql container restarted 08:08 db vault restarted 08:22 db vault restarted 08:28 mssql container restarted

Also during the night the database vault restarted multiple times without anyone using bitwarden:

2018-06-16 04:06:42.81 spid53 Using 'xplog70.dll' version '2017.140.3026' to execute extended stored procedure 'xp_msver'. This is an informational message only; no user action is required. 2018-06-16 04:06:43.54 spid53 Starting up database 'vault'. 2018-06-16 04:06:43.97 spid53 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 04:06:44.03 spid53 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 06:00:05.32 spid54 Starting up database 'vault'. 2018-06-16 06:00:05.79 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 06:00:05.87 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 06:02:22.91 spid53 Using 'dbghelp.dll' version '4.0.5' 2018-06-16 06:34:18.58 spid54 Starting up database 'vault'. 2018-06-16 06:34:18.99 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 06:34:19.06 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 06:57:38.09 spid54 Starting up database 'vault'. 2018-06-16 06:57:38.49 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 06:57:38.57 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 07:05:35.91 spid54 Starting up database 'vault'. 2018-06-16 07:05:36.30 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 07:05:36.37 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1]. 2018-06-16 07:26:48.05 spid54 Starting up database 'vault'. 2018-06-16 07:26:48.43 spid54 Parallel redo is started for database 'vault' with worker pool size [1]. 2018-06-16 07:26:48.51 spid54 Parallel redo is shutdown for database 'vault' with worker pool size [1].

Mart124 commented 6 years ago

Your system has 2GB of RAM which is quite short, do you have at least some swap space ?

ruffy91 commented 6 years ago

root@vault ~ # free -hm total used free shared buff/cache available Mem: 1.9G 924M 84M 40M 936M 815M Swap: 135M 37M 98M

I will increase swap but I did not get any out of memory warnings.

ruffy91 commented 6 years ago

I just watched such a crash via htop. The problem is this process "dotnet /app/Icons.dll" which eats up all the RAM and CPU when loading the web vault. As a consequence the mssql container starves to death.

I will disable icon loading as a workaround and report back if it happens again. If not then the problem is not mssql but "dotnet app/Icons.dll" Unfortunately I found absolutely no dotnet related logfiles.

kspearrin commented 6 years ago

Closing since this is being tracked here #303