basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.92k stars 534 forks source link

Mas i1037 riakdiag #1047

Closed martinsumner closed 3 years ago

martinsumner commented 3 years ago

Alternative PR to https://github.com/basho/riak/pull/1040

This includes updates to riaknostic so riak admin diag will work.

sudo riak debug appears to work straight from install then start (tested on ubuntu 18)

Bob-The-Marauder commented 3 years ago

I'm afraid the substitution is failing as when I run riak debug, it again runs riak-debug -s so it only runs a selection of the commands instead of the full suite. It also fails to collect other information such as logs and patches. Finally it dumps to /tmp/riak@127.0.0.1-riak-debug as a folder and finishes there rather than bundling that folder into a tar.gz file in the folder that called riak debug originally and then removing the folder from the /tmp directory. Please see output differences below:

Pull 1047 top level

[root@localhost ~]# ls /tmp/riak@127.0.0.1-riak-debug/
commands

Pull 1040 top level

[root@localhost ~]# ls riak@127.0.0.1-riak-debug
commands  config  logs  patches  ring

Pull 1047 commands folder

[root@localhost ~]# ls /tmp/riak@127.0.0.1-riak-debug/commands/
blockdev._dev_sda1  df    dmesg  hostname      last   ps   sestatus  sysctl  vmstat
date                df_i  free   java_version  mount  rpm  swapon    uname   w

Pull 1040 commands folder

[root@localhost ~]# ls riak@127.0.0.1-riak-debug/commands/
blockdev._dev_sda1  diskstats       java_version  mount               riak_ping               riak_status     sestatus
cpuinfo             dmesg           last          ps                  riak_repl_clusterstats  riak_transfers  swapon
date                free            limits.conf   redhat_release      riak_repl_connections   riak_version    sysctl
df                  hostname        limits.d      riak_diag           riak_repl_modes         rpm             uname
df_i                iptables_nat    meminfo       riak_http_stats     riak_repl_status        rx_crc_errors   vmstat
disk_by_id          iptables_rules  messages      riak_member_status  riak_ring_status        schedulers      w

If I go in to /usr/lib64/riak/bin and execute ./riak-debug in pull 1047, I get almost correct behaviour. It pulls everything it is meant to, deletes its folder in /tmp and creates a .tar.gz in the folder I executed ./riak-debug from. This direct calling is what the symlinks achieved in pull 1040.

Given the similarity, of the two pull requests,, I think we either add riaknostic to 1040 or add the symlinks to 1047.

martinsumner commented 3 years ago

The problem with the symlink, is that I've had them crash an install (when the symlink is already present). This changes the presence of riak debug into something that is potentially harmful for the people that don't use it.

I will try and look deeper later today.

Bob-The-Marauder commented 3 years ago

Could we get around the pre-existing symlink issue by adding a rm -rf /usr/sbin/riak-debug after

preun
# Pre-uninstall script

in rel/pkg/rpm/specfile and to rel/pkg/deb/debian/postrm? By deleting the symlinks on uninstall, they should not be an issue if they are re-created.

Alternatively we could try something like [ -L /usr/sbin/riak-debug ] || (ln -s /usr/lib64/riak/bin/riak-debug /usr/sbin/riak-debug ) so we only attempt to create the symlink if it doesn't exist. Obviously there would be different paths for Ubuntu/Debian vs Rhel/CentOS.

Please note the above are off the top of my head and not tested yet.

martinsumner commented 3 years ago

As I understand it, the link is only resolving the problem in that it now allows you to run:

sudo riak-debug

i.e. this works without the link if you ran:

sudo /usr/lib/riak/bin/riak-debug

The command we want to run:

sudo riak debug

doesn't work as expected with or without the link. It in fact behaves in a strange way, acting as if you passed in config options to produce some stats and not generate a file. If you pass in config switches it behaves in bizarre ways e.g:

sudo riak debug -r

will produce the riak outputs, and will now attempt to create the tar/zip file, but will fail as it tries to dump the tar/zip file in a privileged location.

It would perhaps be understandable if sudo riak debug behaved as if no switches had been passed, but this isn't true.

So something is garbling the call to riak-debug when it goes through the main riak script. We can ignore this and run riak-debug directly (either by calling it in its actual location or making a symbolic link to the location in the path). But there must exist the concern that this is a canary for a more general problem with passing information through the relx generated riak script.

There's nothing obvious in the actual riak script to explain this:

elx_is_extension() {
    EXTENSION=$1
    case "$EXTENSION" in
        admin|debug|repl|chkconfig|undefined)
            echo "1"
        ;;
        *)
            echo "0"
        ;;
    esac
}
relx_get_extension_script() {
    EXTENSION=$1
    # below are the extensions declarations
    # of the form:
    # foo_extension="path/to/foo_script";bar_extension="path/to/bar_script"
    admin_extension="riak-admin";debug_extension="riak-debug";repl_extension="riak-repl";chkconfig_extension="riak-
chkconfig"
    # get the command extension (eg. foo) and
    # obtain the actual script filename that it
    # refers to (eg. "path/to/foo_script"
    eval echo "$"${EXTENSION}_extension""
}
relx_run_extension() {
    # drop the first argument which is the name of the
    # extension script
    EXTENSION_SCRIPT=$1
    shift
    # all extension script locations are expected to be
    # relative to the start script location
    [ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && . "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"
}
*)
        # check for extension
        IS_EXTENSION=$(relx_is_extension $1)
        if [ "$IS_EXTENSION" = "1" ]; then
            EXTENSION_SCRIPT=$(relx_get_extension_script $1)
            shift
            relx_run_extension $EXTENSION_SCRIPT "$@"
            # all extension scripts are expected to exit
        else
            relx_usage $1
        fi
        exit 1
        ;;
esac

and the riak-debug script is explicit what the defaults should be if nothing is passed (and this isn't happening):

if [ 0 -eq $(( $get_cfgs + $get_logs + $get_patches + $get_riakcmds + $get_yzcmds + $get_syscmds + $get_extracmds )
) ]; then
    # Nothing specific was requested, so get everything except extracmds
    get_cfgs=1
    get_logs=1
    get_patches=1
    get_riakcmds=1
    get_yzcmds=1
    get_syscmds=1
    get_extracmds=0
fi
martinsumner commented 3 years ago

So I don't think this is because the wrong switches are being passed in e.g. "-s"

If we're running riak debug we're doing this as the riak user. I think the script is failing at some point (probably whilst getting the system data, and as the system data is the first thing to run that's why it looks like it is as if just that switch has been passed). This may be because of privileges - whereas if we run sudo riak-debug directly were doing this as root.

Bob-The-Marauder commented 3 years ago

That's exactly where my previous work got me to. I couldn't find for the life of me why riak debug called riak-debug -s and then stopped working at the /tmp directory level. I, possibly foolishly, thought it was only related to riak debug and, given time constraints, opted to find a temporary work around rather than a solution. Hence the symlinks.

What you've expressed here has raised a concern that there is more potential available for riak [something] to garble an argument or pass the wrong one to commands other than riak debug.

martinsumner commented 3 years ago

OK, so it isn't running riak-debug -s, it is running riak-debug - however it is crashing when in that part of the debug script(the collecting of system information), so it looks like it runs -s as that is the only output you see.

This is why it doesn't try to produce the tar file, the script has already exited before it gets to that point.

So running the script direct works, but crashes when run via usr/sbin/riak and the relx generated start script.

Even if we stop the su - riak when we run riak debug, it still crashes. The problem is related to this in relx:

# all extension script locations are expected to be
    # relative to the start script location
    [ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && . "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"

if we instead use:

    [ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"

The script will now run through to completion (but with errors). So it is this use of ". "$SCRIPT" to call script rather than simply "$SCRIPT" that causes the script to fail.

Precisely what this notation means is hard to discover via google. But obviously, we don't really want to mess about in the relx extensions stuff - so perhaps it might be better to just try and treat debug differently within usr\sbin\riak

martinsumner commented 3 years ago

The branch has bene updated to bypass the relx generated script, without using symlinks.

mas@riak-ubuntu18:~/dbroot/riak$ sudo dpkg -i rel/pkg/packages/riak_3.0.2dbg1-OTP22.3_amd64.deb 
Selecting previously unselected package riak.
(Reading database ... 184169 files and directories currently installed.)
Preparing to unpack .../riak_3.0.2dbg1-OTP22.3_amd64.deb ...
Unpacking riak (3.0.2dbg1-OTP22.3) ...
Setting up riak (3.0.2dbg1-OTP22.3) ...
Processing triggers for ureadahead (0.100.0-21) ...
Processing triggers for systemd (237-3ubuntu10.42) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
mas@riak-ubuntu18:~/dbroot/riak$ sudo service riak start
mas@riak-ubuntu18:~/dbroot/riak$ sudo riak debug
......................EEEEEE.................................E. ~/dbroot/riak/riak@127.0.0.1-riak-debug.tar.gz
mas@riak-ubuntu18:~/dbroot/riak$

So there are still some errors when running riak-debug, but the script will run through to the end.

martinsumner commented 3 years ago

@martincox @Bob-The-Marauder any objections to proceeding with this version of the PR?