Open eadamsintel opened 6 years ago
Let's start with an ack - I can re-create the issue here as well. Now, I wonder where the problem is - is it going to be inside the container/VM, or outside... my gut tells me there will either be maybe:
I'm going to look into how we check out and track the port mappings both in the container (which might mean we have to enable the VM OS debug shell), and on the host side (which might mean digging into docker namespaces).
@sboeuf @amshinde - any ideas from your side around the agent/networking/port mapping side?
@grahamwhaley no idea on the top of my head, this needs further investigations.
Hi @eadamsintel - please can you:
Enable debug https://github.com/clearcontainers/runtime#debugging
Run the command (as requested in the issue template), and paste the output here:
$ sudo cc-collect-data.sh
I'm having a peek at this btw...
OK, some more info.
I noticed inside the container that with cc
we are cycling through pids for
abc 2149 202 0 14:05 ? 00:00:00 mono --debug Radarr.exe -nobrows
whereas we don't with runc
.
If you run the docker command with -ti
and drop the -d
, then you find that for cc
we get a repeating
Press enter to exit...
prompt appearing over and over. I suspect therefore that something is upsetting and/or not working for the mono
invocation, and it is stuck in a retry loop. Hence, the server is not up, so we cannot connect to the 7878 port. afaict, the port looks mapped on the host side btw - I think this is therefore likely not a portmap issue, but a mono
execution issue.
Not sure how much this is going to help somebody (I have yet to digest it), but...
/var/run/s6/services/radarr
run
file there to a backup, and then make that run
benign with something like a tail -f /dev/null
to stop the system trying to restart the broken-nesscd /opt/radarr; mono --debug Radarr.exe --nobrowser -data=/config
Then I end up with:
[Fatal] ConsoleApp: EPIC FAIL!
[v0.2.0.935] NzbDrone.Core.Datastore.CorruptDatabaseException: Database file: /config/nzbdrone.db is corrupt, restore from backup if available. See: https://github.com/Radarr/Radarr/wiki/FAQ#i-am-getting-an-error-database-disk-image-is-malformed ---> System.Data.SQLite.SQLiteException: disk I/O error
disk I/O error
at System.Data.SQLite.SQLite3.Prepare (System.Data.SQLite.SQLiteConnection cnn, System.String strSql, System.Data.SQLite.SQLiteStatement previous, System.UInt32 timeoutMS, System.String& strRemain) [0x0033c] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteCommand.BuildNextCommand () [0x000f6] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteCommand.GetStatement (System.Int32 index) [0x00008] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteCommand.GetStatement(int)
at System.Data.SQLite.SQLiteDataReader.NextResult () [0x0011e] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteDataReader..ctor (System.Data.SQLite.SQLiteCommand cmd, System.Data.CommandBehavior behave) [0x00090] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteDataReader..ctor(System.Data.SQLite.SQLiteCommand,System.Data.CommandBehavior)
at System.Data.SQLite.SQLiteCommand.ExecuteReader (System.Data.CommandBehavior behavior) [0x0000c] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery (System.Data.CommandBehavior behavior) [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery () [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at System.Data.SQLite.SQLiteConnection.Open () [0x00959] in <61a20cde294d4a3eb43b9d9f6284613b>:0
at FluentMigrator.Runner.Processors.GenericProcessorBase.EnsureConnectionIsOpen () [0x0000e] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\GenericProcessorBase.cs:54
at FluentMigrator.Runner.Processors.SQLite.SQLiteProcessor.Exists (System.String template, System.Object[] args) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\SQLite\SQLiteProcessor.cs:78
at FluentMigrator.Runner.Processors.SQLite.SQLiteProcessor.TableExists (System.String schemaName, System.String tableName) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\SQLite\SQLiteProcessor.cs:47
at FluentMigrator.Runner.VersionLoader.get_AlreadyCreatedVersionTable () [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:124
at FluentMigrator.Runner.VersionLoader.LoadVersionInfo () [0x00028] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:160
at FluentMigrator.Runner.VersionLoader..ctor (FluentMigrator.Runner.IMigrationRunner runner, FluentMigrator.Infrastructure.IAssemblyCollection assemblies, FluentMigrator.IMigrationConventions conventions) [0x00077] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:50
at FluentMigrator.Runner.MigrationRunner..ctor (FluentMigrator.Infrastructure.IAssemblyCollection assemblies, FluentMigrator.Runner.Initialization.IRunnerContext runnerContext, FluentMigrator.IMigrationProcessor processor) [0x00167] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\MigrationRunner.cs:102
at FluentMigrator.Runner.MigrationRunner..ctor (System.Reflection.Assembly assembly, FluentMigrator.Runner.Initialization.IRunnerContext runnerContext, FluentMigrator.IMigrationProcessor processor) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\MigrationRunner.cs:72
at NzbDrone.Core.Datastore.Migration.Framework.MigrationController.Migrate (System.String connectionString, NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x000b5] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\Migration\Framework\MigrationController.cs:58
at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x00048] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:84
--- End of inner exception stack trace ---
at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x00121] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:116
at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationType migrationType) [0x00000] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:56
at NzbDrone.Core.Datastore.DbFactory.RegisterDatabase (NzbDrone.Common.Composition.IContainer container) [0x00000] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:36
at Radarr.Host.NzbDroneServiceFactory.Start () [0x00037] in C:\projects\radarr-usby1\src\NzbDrone.Host\ApplicationServer.cs:60
at Radarr.Host.Router.Route (Radarr.Host.ApplicationModes applicationModes) [0x00067] in C:\projects\radarr-usby1\src\NzbDrone.Host\Router.cs:38
at Radarr.Host.Bootstrap.Start (Radarr.Host.ApplicationModes applicationModes, NzbDrone.Common.EnvironmentInfo.StartupContext startupContext) [0x0003d] in C:\projects\radarr-usby1\src\NzbDrone.Host\Bootstrap.cs:71
at Radarr.Host.Bootstrap.Start (NzbDrone.Common.EnvironmentInfo.StartupContext startupContext, Radarr.Host.IUserAlert userAlert, System.Action`1[T] startCallback) [0x00075] in C:\projects\radarr-usby1\src\NzbDrone.Host\Bootstrap.cs:39
at NzbDrone.Console.ConsoleApp.Main (System.String[] args) [0x0000e] in C:\projects\radarr-usby1\src\NzbDrone.Console\ConsoleApp.cs:27
Press enter to exit...
ah, ok, that is a 'database fail' on /config, which smells like 9pfs issues to me... let's try...
mkdir /dev/shm/config
cd /opt/radarr; mono --debug Radarr.exe --nobrowser -data=/dev/shm/config
to place the db on a tmpfs (ramfs) in the container - and - voila - we don't get the catastrophic failure, and I can browse the container on 7878.
/cc @eadamsintel - I think there is the root of the issue ;-)
@grahamwhaley oh nice and quick debug ! What's the next step ? Because it's 9p issue, does that mean we cannot expect this to work ?
:-( I'd have to take the next step in debug to be decisive - we'd have to know exactly what failed with the 9pfs mounted files - I suspect it will be one of the 'unlink' related issues. Normally I use strace to find that, but for mono, which is a JIT'd VM, I wonder how well that will work? :-)
Short term, at least we know what the problem is. Mid term, we could re-visit the 9p patch sets and also look at what runv is carrying and see if we can improve the situation. Long term, we need a more POSIX compliant fs solution.
@grahamwhaley using devmapper might solve this issue then (unless the file that needs to be accessed is passed through 9p as an extra mount on top of the rootfs).
yeah, I considered that - it is a -v
volume mapping, which I think always goes as a 9p mount, doesn't it? (/cc @amshinde ) Which, surprised me a couple of weeks ago, but having seen a recent conversation, I think we don't block mount volumes apart from the (readonly?) rootfs, as then the 'device' would be double mounted - once in host and once on the guest, and there could then be fs write races between the two that [cw]ould then corrupt the FS....
Oh yeah... I haven't realized this was a -v
assignment. In this case, we use 9p because we don't have the ability to package that into a block device that we could hotplug...
@grahamwhaley Yes the -v
bindings are always passed using 9pfs. We havent implemented checks for verifying if the volume passed with -v
is a mount backed by a block device. We do need to implement that, as we just handle this case with --device
.
Maybe we can try this out, loopmount an image and pass the loop device as --device /dev/loop#/config
and see if that helps.
That's an idea @amshinde - hmm, I wonder if that is viable as an interim 'hack' to mount volumes into the VMs as block devices, by a loopback and device mount. It's worth a try to see if it does work and fixes the issue initially anyhow... I'll add it to my list.
This should work but don't expect good performances.
When testing a popular docker hub image called linuxserver/radarr (10 million pulls) you can't connect to port 7878 from a browser when using cc-runtime but runc works as expected.
First create a config directory at /config
mkdir /config
Run the container and attempt to go to http://:7878 and it works under runc but won't connect under cc-runtime.
docker run -d --runtime=runc --name=radarr -v /config:/config -p 7878:7878 linuxserver/radarr This works and you can go to http://localhost:7878
docker run -d --runtime=cc-runtime --name=radarr -v /config:/config -p 7878:7878 linuxserver/radarr This does not work and http://localhost:7878 times out
Trying the same thing with an nginx container works fine but the nginx container monitors port 80 but passing in 7878 as the host port to use still works.
cc-runtime version 3.0.16 runc version 1.0.0-rc4+dev docker version 17.09.1 Clear Linux version 20650