Dockerhub image from linuxserver works with runc but not cc-runtime

eadamsintel commented 6 years ago

When testing a popular docker hub image called linuxserver/radarr (10 million pulls) you can't connect to port 7878 from a browser when using cc-runtime but runc works as expected.

First create a config directory at /config

mkdir /config

Run the container and attempt to go to http://:7878 and it works under runc but won't connect under cc-runtime.

docker run -d --runtime=runc --name=radarr -v /config:/config -p 7878:7878 linuxserver/radarr This works and you can go to http://localhost:7878

docker run -d --runtime=cc-runtime --name=radarr -v /config:/config -p 7878:7878 linuxserver/radarr This does not work and http://localhost:7878 times out

Trying the same thing with an nginx container works fine but the nginx container monitors port 80 but passing in 7878 as the host port to use still works.

cc-runtime version 3.0.16 runc version 1.0.0-rc4+dev docker version 17.09.1 Clear Linux version 20650

grahamwhaley commented 6 years ago

Let's start with an ack - I can re-create the issue here as well. Now, I wonder where the problem is - is it going to be inside the container/VM, or outside... my gut tells me there will either be maybe:

something funky about mapping high number ports inside the container out to the host - maybe our kernel or agent namespacing is 'blocking' them somehow
or something slightly 'further out' onto the host side, to do with QEMU/KVM maybe.

I'm going to look into how we check out and track the port mappings both in the container (which might mean we have to enable the VM OS debug shell), and on the host side (which might mean digging into docker namespaces).

@sboeuf @amshinde - any ideas from your side around the agent/networking/port mapping side?

sboeuf commented 6 years ago

@grahamwhaley no idea on the top of my head, this needs further investigations.

jodh-intel commented 6 years ago

Hi @eadamsintel - please can you:

Enable debug https://github.com/clearcontainers/runtime#debugging
Run the command (as requested in the issue template), and paste the output here:
```
$ sudo cc-collect-data.sh
```

grahamwhaley commented 6 years ago

I'm having a peek at this btw...

grahamwhaley commented 6 years ago

OK, some more info. I noticed inside the container that with cc we are cycling through pids for

abc 2149 202 0 14:05 ? 00:00:00 mono --debug Radarr.exe -nobrows

whereas we don't with runc.

If you run the docker command with -ti and drop the -d, then you find that for cc we get a repeating

Press enter to exit...

prompt appearing over and over. I suspect therefore that something is upsetting and/or not working for the mono invocation, and it is stuck in a retry loop. Hence, the server is not up, so we cannot connect to the 7878 port. afaict, the port looks mapped on the host side btw - I think this is therefore likely not a portmap issue, but a mono execution issue.

grahamwhaley commented 6 years ago

Not sure how much this is going to help somebody (I have yet to digest it), but...

if you run the container with a bash shell
and go down to /var/run/s6/services/radarr
copy the run file there to a backup, and then make that run benign with something like a tail -f /dev/null to stop the system trying to restart the broken-ness
and then hand run the command:

cd /opt/radarr; mono --debug Radarr.exe --nobrowser -data=/config

Then I end up with:

[Fatal] ConsoleApp: EPIC FAIL!

[v0.2.0.935] NzbDrone.Core.Datastore.CorruptDatabaseException: Database file: /config/nzbdrone.db is corrupt, restore from backup if available. See: https://github.com/Radarr/Radarr/wiki/FAQ#i-am-getting-an-error-database-disk-image-is-malformed ---> System.Data.SQLite.SQLiteException: disk I/O error
disk I/O error
  at System.Data.SQLite.SQLite3.Prepare (System.Data.SQLite.SQLiteConnection cnn, System.String strSql, System.Data.SQLite.SQLiteStatement previous, System.UInt32 timeoutMS, System.String& strRemain) [0x0033c] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteCommand.BuildNextCommand () [0x000f6] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteCommand.GetStatement (System.Int32 index) [0x00008] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteCommand.GetStatement(int)
  at System.Data.SQLite.SQLiteDataReader.NextResult () [0x0011e] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteDataReader..ctor (System.Data.SQLite.SQLiteCommand cmd, System.Data.CommandBehavior behave) [0x00090] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at (wrapper remoting-invoke-with-check) System.Data.SQLite.SQLiteDataReader..ctor(System.Data.SQLite.SQLiteCommand,System.Data.CommandBehavior)
  at System.Data.SQLite.SQLiteCommand.ExecuteReader (System.Data.CommandBehavior behavior) [0x0000c] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery (System.Data.CommandBehavior behavior) [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteCommand.ExecuteNonQuery () [0x00006] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at System.Data.SQLite.SQLiteConnection.Open () [0x00959] in <61a20cde294d4a3eb43b9d9f6284613b>:0
  at FluentMigrator.Runner.Processors.GenericProcessorBase.EnsureConnectionIsOpen () [0x0000e] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\GenericProcessorBase.cs:54
  at FluentMigrator.Runner.Processors.SQLite.SQLiteProcessor.Exists (System.String template, System.Object[] args) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\SQLite\SQLiteProcessor.cs:78
  at FluentMigrator.Runner.Processors.SQLite.SQLiteProcessor.TableExists (System.String schemaName, System.String tableName) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\Processors\SQLite\SQLiteProcessor.cs:47
  at FluentMigrator.Runner.VersionLoader.get_AlreadyCreatedVersionTable () [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:124
  at FluentMigrator.Runner.VersionLoader.LoadVersionInfo () [0x00028] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:160
  at FluentMigrator.Runner.VersionLoader..ctor (FluentMigrator.Runner.IMigrationRunner runner, FluentMigrator.Infrastructure.IAssemblyCollection assemblies, FluentMigrator.IMigrationConventions conventions) [0x00077] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\VersionLoader.cs:50
  at FluentMigrator.Runner.MigrationRunner..ctor (FluentMigrator.Infrastructure.IAssemblyCollection assemblies, FluentMigrator.Runner.Initialization.IRunnerContext runnerContext, FluentMigrator.IMigrationProcessor processor) [0x00167] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\MigrationRunner.cs:102
  at FluentMigrator.Runner.MigrationRunner..ctor (System.Reflection.Assembly assembly, FluentMigrator.Runner.Initialization.IRunnerContext runnerContext, FluentMigrator.IMigrationProcessor processor) [0x00000] in C:\Users\Mark\Source\Repos\fluentmigrator\src\FluentMigrator.Runner\MigrationRunner.cs:72
  at NzbDrone.Core.Datastore.Migration.Framework.MigrationController.Migrate (System.String connectionString, NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x000b5] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\Migration\Framework\MigrationController.cs:58
  at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x00048] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:84
   --- End of inner exception stack trace ---
  at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationContext migrationContext) [0x00121] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:116
  at NzbDrone.Core.Datastore.DbFactory.Create (NzbDrone.Core.Datastore.Migration.Framework.MigrationType migrationType) [0x00000] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:56
  at NzbDrone.Core.Datastore.DbFactory.RegisterDatabase (NzbDrone.Common.Composition.IContainer container) [0x00000] in C:\projects\radarr-usby1\src\NzbDrone.Core\Datastore\DbFactory.cs:36
  at Radarr.Host.NzbDroneServiceFactory.Start () [0x00037] in C:\projects\radarr-usby1\src\NzbDrone.Host\ApplicationServer.cs:60
  at Radarr.Host.Router.Route (Radarr.Host.ApplicationModes applicationModes) [0x00067] in C:\projects\radarr-usby1\src\NzbDrone.Host\Router.cs:38
  at Radarr.Host.Bootstrap.Start (Radarr.Host.ApplicationModes applicationModes, NzbDrone.Common.EnvironmentInfo.StartupContext startupContext) [0x0003d] in C:\projects\radarr-usby1\src\NzbDrone.Host\Bootstrap.cs:71
  at Radarr.Host.Bootstrap.Start (NzbDrone.Common.EnvironmentInfo.StartupContext startupContext, Radarr.Host.IUserAlert userAlert, System.Action`1[T] startCallback) [0x00075] in C:\projects\radarr-usby1\src\NzbDrone.Host\Bootstrap.cs:39
  at NzbDrone.Console.ConsoleApp.Main (System.String[] args) [0x0000e] in C:\projects\radarr-usby1\src\NzbDrone.Console\ConsoleApp.cs:27

Press enter to exit...

ah, ok, that is a 'database fail' on /config, which smells like 9pfs issues to me... let's try...

mkdir /dev/shm/config
cd /opt/radarr; mono --debug Radarr.exe --nobrowser -data=/dev/shm/config

to place the db on a tmpfs (ramfs) in the container - and - voila - we don't get the catastrophic failure, and I can browse the container on 7878.

/cc @eadamsintel - I think there is the root of the issue ;-)

sboeuf commented 6 years ago

@grahamwhaley oh nice and quick debug ! What's the next step ? Because it's 9p issue, does that mean we cannot expect this to work ?

grahamwhaley commented 6 years ago

:-( I'd have to take the next step in debug to be decisive - we'd have to know exactly what failed with the 9pfs mounted files - I suspect it will be one of the 'unlink' related issues. Normally I use strace to find that, but for mono, which is a JIT'd VM, I wonder how well that will work? :-)

Short term, at least we know what the problem is. Mid term, we could re-visit the 9p patch sets and also look at what runv is carrying and see if we can improve the situation. Long term, we need a more POSIX compliant fs solution.

sboeuf commented 6 years ago

@grahamwhaley using devmapper might solve this issue then (unless the file that needs to be accessed is passed through 9p as an extra mount on top of the rootfs).

grahamwhaley commented 6 years ago

yeah, I considered that - it is a -v volume mapping, which I think always goes as a 9p mount, doesn't it? (/cc @amshinde ) Which, surprised me a couple of weeks ago, but having seen a recent conversation, I think we don't block mount volumes apart from the (readonly?) rootfs, as then the 'device' would be double mounted - once in host and once on the guest, and there could then be fs write races between the two that [cw]ould then corrupt the FS....

sboeuf commented 6 years ago

Oh yeah... I haven't realized this was a -v assignment. In this case, we use 9p because we don't have the ability to package that into a block device that we could hotplug...

amshinde commented 6 years ago

@grahamwhaley Yes the -v bindings are always passed using 9pfs. We havent implemented checks for verifying if the volume passed with -v is a mount backed by a block device. We do need to implement that, as we just handle this case with --device.

Maybe we can try this out, loopmount an image and pass the loop device as --device /dev/loop#/config and see if that helps.

grahamwhaley commented 6 years ago

That's an idea @amshinde - hmm, I wonder if that is viable as an interim 'hack' to mount volumes into the VMs as block devices, by a loopback and device mount. It's worth a try to see if it does work and fixes the issue initially anyhow... I'll add it to my list.

sboeuf commented 6 years ago

This should work but don't expect good performances.

clearcontainers / runtime

Dockerhub image from linuxserver works with runc but not cc-runtime #986