basho-labs / riak-mesos-tools

CLI and other tools for interacting with the Riak Mesos Framework.
Apache License 2.0
3 stars 6 forks source link

Scale up of version 1.0.0 results in rolling crashes of Riak on all nodes. #26

Closed ChiralMedia closed 8 years ago

ChiralMedia commented 8 years ago

Installing v1.0.0 from

dcos package repo add --index=0 RMF https://github.com/basho-labs/riak-mesos-dcos-repo/archive/dcoscli-v0.4.x.zip dcos package install --options=/etc/riak-mesos/config.json riak

Results in stable single-instance of Riak, clicking on link in Marathon shows Riak explorer. Scaling up to more than one node results in all nodes repeatedly staging Riak, starting it and then it failing.

Config file is attached.

Any required logs can be obtained - let us know what you need and where it is.

config.json.txt

sanmiguel commented 8 years ago

The task you see in Marathon would be the Scheduler component of the framework. In order to see riak nodes as they start, you'll need to look in the Mesos UI (normally at http://your-dcos-master/mesos).

The files that will help us debug the problem you are having are the contents of the riak_mesos_scheduler/log directory from the task container. You should be able to access this via the mesos UI, by clicking on the "Sandbox" link on the right-hand side of the riak.<uuid> task.

ChiralMedia commented 8 years ago

Attached are some logs and screen shots of the issue, the logs start with a single instance running. It looks like the Riak service is not being unpacked, there is an entry in the log (attached as Initial_stderr) that states:

I0726 09:51:28.690701 7854 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slave/slaves/973f300c-8102-4260-b44b-1583189f2e2d-S2/frameworks/81850ec1-edb2-4feb-ab76-a689a3e11312-0000/executors/riak.130e4b34-530e-11e6-9f00-024252c70cb1/runs/1c2347b0-42b3-49f7-84d8-3af052d7b1c4' -xf '/var/lib/mesos/slave/slaves/973f300c-8102-4260-b44b-1583189f2e2d-S2/frameworks/81850ec1-edb2-4feb-ab76-a689a3e11312-0000/executors/riak.130e4b34-530e-11e6-9f00-024252c70cb1/runs/1c2347b0-42b3-49f7-84d8-3af052d7b1c4/riak-2.1.4-centos-7.tar.gz' I0726 09:51:29.832343 7854 fetcher.cpp:92] Extracted '/var/lib/mesos/slave/slaves/973f300c-8102-4260-b44b-1583189f2e2d-S2/frameworks/81850ec1-edb2-4feb-ab76-a689a3e11312-0000/executors/riak.130e4b34-530e-11e6-9f00-024252c70cb1/runs/1c2347b0-42b3-49f7-84d8-3af052d7b1c4/riak-2.1.4-centos-7.tar.gz' into '/var/lib/mesos/slave/slaves/973f300c-8102-4260-b44b-1583189f2e2d-S2/frameworks/81850ec1-edb2-4feb-ab76-a689a3e11312-0000/executors/riak.130e4b34-530e-11e6-9f00-024252c70cb1/runs/1c2347b0-42b3-49f7-84d8-3af052d7b1c4'

Initial_stderr.txt Initial_stdout.txt

But none of the expected files are found in that directory, riak-admin, riak etc. are not present in any sub directory. directorylisting so it doesn't appear Riak is running. Riak explorer is unable to connect. Unpacking this file manually allows the Riak service to be started with ./riak start but again Riak Explorer is unable to find any nodes.

riakexplorer

Scaling up results in each of the instances crashing and being repeatedly restarted by marathon. singlehealthyinstance failures

After it fails the scheduler log has the following: 2016-07-26 10:22:05.447 [info] <0.143.0>@rms_scheduler:error:208 Scheduler received error event. Error message: Framework failed over.

There are a considerable number of 2016-07-26 10:22:02.943 [warning] <0.143.0>@rms_cluster_manager:apply_unreserved_offer:320 Applying of unreserved resources error. Node key: riak-default-1. Offer id: f92e9a8e-3dc9-4525-96d0-45aa80b4cf93-O948. Error reason: not_enough_resources.

However each machine is 4 cores and 8GB, reducing configured required memory in the DCOS gui for the Riak application has no effect.

Scheduler_console.log.txt Scheduler_crash.log.txt Scheduler_error.log.txt

sanmiguel commented 8 years ago

Thanks for providing all of this data.

I see now at least part of the problem you are having... The task you have running there is the Scheduler component, not Riak itself. This is the part of the framework that takes care of coordinating Riak clusters and nodes. It is not currently possible to run multiple instances of the Scheduler from the same configuration in the same DCOS cluster, which is what Marathon will be attempting to do when you scale up to >1 instance via the Marathon UI. In our setup, Marathon does not manage the Riak nodes themselves, only the Scheduler which manages the nodes directly in Mesos.

The Scheduler does not run Riak itself, but delegates that to the Executors that it manages in Mesos. Once running, the Scheduler would not have Riak inside its own container, except as the package it will install in an Executor when starting a new node (which you can see inside the artifacts/ folder).

In order to interact with it and create a cluster, you need to use the CLI we provide, via the dcos riak sub-command.

For example, to create a 3-node Riak cluster, you would need to run:

dcos riak cluster create --cluster mycluster
dcos riak node add --cluster mycluster --nodes 3

However, judging from your logfiles, you have already done this: this is why the Scheduler is trying to schedule nodes (the default value for the --cluster name argument is "default", which is why your nodes are named e.g. "riak-default-1") and you are seeing those loglines about not_enough_resources.

I'm assuming you're still using the same config.json you linked previously, which specifies that RMF should reserve 8000.0MB of memory for each Riak node. Your mesos-agent machines have only 8GB of memory total and Mesos will use some memory for itself and some will have been allocated to the Scheduler itself. If you lower the .riak.node.mem value in your config.json and re-install the framework, you should be able to start your cluster.

Immediately before the loglines for not_enough_resources, you can see the resources Mesos is offering to the Scheduler: [EDIT: reformatted for clarity]

2016-07-26 10:22:04.945 [info] <0.143.0>@rms_scheduler:apply_offer:381
    Scheduler recevied offer. Offer id: f92e9a8e-3dc9-4525-96d0-45aa80b4cf93-O949.
    Resources: [{reserved,[{cpus,0.0},{mem,0.0},{disk,0.0},{num_ports,0},{num_persistence_ids,0}]},
                       {unreserved,[{cpus,3.5},{mem,4677.0},{disk,39993.0},{num_ports,30968}]}].
    Constraints: []. 

Here you can see that Mesos is offering only 4677.0MB of memory in this offer. An offer from Mesos-master would only be for resources from a single mesos-agent at a time, which is why there are multiple offers logged with different amounts of resources in each.

There is a change in the 1.3.0 release (I am working on publishing the artifacts for which now) which logs more explicitly which resources are lacking when attempting to start a Riak node.

ChiralMedia commented 8 years ago

Thanks for the response, I'd come to the conclusion that there should be only one running, but it wasn't clear where the Riak DB servers where going to come from. I also altered the memory requirement - I was under the impression that it would be picked up when the application was restarted, I didn't realise a reinstall of the package would be required.

On the up side Riak now deploys, the tasks appear listed under 'Riak', so that now works, thanks.

On the downside the Riak explorer now displays '503 service unavailable', while last time it was unable to see the cluster it was working to some extent, since the cluster is up it's failed.

riakrunning riakexplorerissue

sanmiguel commented 8 years ago

Unfortunately this final problem is a little fiddly to work around, but fixable.

There's a bug in the version of adminrouter that ships with DCOS v1.7.0 (and earlier). We have a patch that's been merged into master, but this is yet to be released fully.

The workaround for now is to patch the file directly on your dcos master machines.

https://github.com/dcos/adminrouter/pull/8/files

We have had success doing this by simply replacing the file at /opt/mesosphere/packages/adminrouter-*/nginx/conf/service.lua with the copy from that PR.

ChiralMedia commented 8 years ago

The file change has no effect. Adding a node to any cluster still results in the riak explorer failing with 503

I've opened a case on it here https://github.com/basho-labs/riak-explorer-gui/issues/131

Thanks for all your help with this.


From: sanmiguel notifications@github.com Sent: 27 July 2016 17:46 To: basho-labs/riak-mesos-tools Cc: ChiralMedia; Author Subject: Re: [basho-labs/riak-mesos-tools] Scale up of version 1.0.0 results in rolling crashes of Riak on all nodes. (#26)

Unfortunately this final problem is a little fiddly to work around, but fixable.

There's a bug in the version of adminrouter that ships with DCOS v1.7.0 (and earlier). We have a patch that's been merged into master, but this is yet to be released fully.

The workaround for now is to patch the file directly on your dcos master machines.

https://github.com/dcos/adminrouter/pull/8/files

We have had success doing this by simply replacing the file at /opt/mesosphere/packages/adminrouter-*/nginx/conf/service.lua with the copy from that PR.

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/basho-labs/riak-mesos-tools/issues/26#issuecomment-235663901, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH_QKmO752v9-OZR9g0jygqr5cLdwnNoks5qZ5ljgaJpZM4JSx9P.

drewkerrigan commented 8 years ago

Hello,

The fact that you were able to run dcos riak node add successfully at all suggests to me that adminrouter is actually functioning properly, and that something else is going on here.

There are a few possibilities that may be resulting in the 503 that Riak Explorer is displaying, but I'd like you to try a few things to verify:

  1. Try going to your http://dcosmaster.com/services/riak to see that DCOS server is properly routing to the scheduler. If successful this should look the same as when you click on the link for the riak scheduler in marathon
  2. On the page that you're seeing the 503, open up developer tools in your browser (for chrome, right click on the page and choose "inspect"). Reload the page and monitor the network requests for one returning 503, or other 5XX codes. Copy and paste the returned headers, as well as the response text (if any). If this error is coming from the scheduler, you should see an Erlang stack trace which would give more information.

Thanks!

ChiralMedia commented 8 years ago

Files attached below.

From what I can see in the code the 503 response from ember-riak-explorer.js and is a result of not being able to find 'cluster' in the riak_explorer.conf file. Also from application.js or as a catch-all in router.js. I'd thought that that file (riak_explorer.conf) is distributed automatically with the scheduler/explorer etc. and would have been updated when I add new clusters/nodes using dcos riak cluster create etc.

ClustersRequest-1.txt ClustersResponse-1.txt screenshot from 2016-07-28 20-06-02 screenshot from 2016-07-28 20-08-39

drewkerrigan commented 8 years ago

Ok, this clears things up a bit.

The request flow that you're encountering goes something like this:

  1. Browser asks scheduler for list of clusters and information about those clusters
  2. scheduler finds the cluster you've created, and attempts to get the additional information about Riak installation from one of the running riak nodes, probably riak-default-1
  3. The request from the scheduler to the riak-default-1 node relies on the riak_explorer OTP application running inside of Riak on the riak-default-1 node. You're seeing the 500 error because riak-default-1 is returning 404 Not Found, and the scheduler was not expecting that response.

This indicates to me that at least one of your riak nodes is running, or else you would not have even seen that 404 Not Found coming back from the node, it would have been a different error or symptom.

It also tells me that the riak_explorer OTP app was not properly included with your Riak distribution. Usually this would happen because of a problem with the configuration you used when installing the Riak Mesos Framework. Looking back at one of your previous comments, I think I see what the problem is:

[root@a2 ...]# ls
artifacts
...
riak_explorer
riak_mesos_scheduler
stderr
stdout
...

That is the scheduler's directory structure, and the riak_explorer should have been cleaned up by the script that starts the scheduler. It doesn't really matter if it stays or not, but the fact that it wasn't cleaned means that the name of your riak_explorer tarball (riak_explorer-1.1.1-centos-7.tar.gz) wasn't properly passed to the scheduler via riak-mesos-tools.

Could you verify that you have this line in your config.json? https://github.com/basho-labs/riak-mesos-tools/blob/master/config/config.dcos.json#L35

Also verify that the value exactly matches the tarball name (riak_explorer-1.1.1-centos-7.tar.gz in your case).

You'll need to tear down your cluster and reinstall the dcos riak package again to test this change out.

dcos riak cluster destroy
dcos package uninstall riak
dcos package install riak --options config.json
dcos riak cluster create
...

P.S.: You've pointed out a problem with the scheduler and/or executor - we will need to create an issue on the riak-mesos-scheduler repo to fix the handling of unexpected responses so that we can make it more clear that something was missing from the riak installation, rather than just throwing a 500.

drewkerrigan commented 8 years ago

Actually, now that I look at the scheduler start script (https://github.com/basho-labs/riak-mesos-scheduler/blob/master/rel/files/ermf-scheduler.sh#L43-L44) it looks like the riak_explorer directory does not get cleaned up, so there might still be some other problem.

Could you go into the mesos interface, find riak-default-1 (or one of your other running node tasks), click the sandbox link, and attach the stderr and stdout files from the executor? I'm guessing there was probably an issue when it attempted to either download or expand the riak_explorer-1.1.1-centos-7.tar.gz from the scheduler.

Thanks!

drewkerrigan commented 8 years ago

And I just realized another potential problem you have - apologies for the confusion!

Your riak_explorer tarball name looked a little suspicious - it doesn't contain ".patch". The riak_explorer-1.1.1-centos-7.tar.gzis a full standalone Erlang application, not meant for mesos - you actually wanted riak_explorer-1.1.1.patch-centos-7.tar.gz. Since you're going to need to update the value anyway, go ahead and upgrade to version 1.1.2. Your config should look like this:

{
    "riak": {
        "node": {
            "explorer-url": "https://github.com/basho-labs/riak_explorer/releases/download/1.1.2/riak_explorer-1.1.2.patch-centos-7.tar.gz",
            "explorer-package": "riak_explorer-1.1.2.patch-centos-7.tar.gz",
            ...
        },
    ...
    }
}
ChiralMedia commented 8 years ago

It looks like it was the .patch issue. After updating the config and reinstalling of the framework the issue is fixed.

Thanks so much for all your help guys.

screenshot from 2016-07-28 22-38-28

sanmiguel commented 8 years ago

It looks like your problems were resolved, so I'm going to close this issue.

Please feel free to reopen if you are still experiencing problems.

Thanks!