Closed martinsumner closed 3 years ago
I'm afraid the substitution is failing as when I run riak debug
, it again runs riak-debug -s
so it only runs a selection of the commands instead of the full suite. It also fails to collect other information such as logs and patches. Finally it dumps to /tmp/riak@127.0.0.1-riak-debug as a folder and finishes there rather than bundling that folder into a tar.gz file in the folder that called riak debug
originally and then removing the folder from the /tmp directory. Please see output differences below:
Pull 1047 top level
[root@localhost ~]# ls /tmp/riak@127.0.0.1-riak-debug/
commands
Pull 1040 top level
[root@localhost ~]# ls riak@127.0.0.1-riak-debug
commands config logs patches ring
Pull 1047 commands folder
[root@localhost ~]# ls /tmp/riak@127.0.0.1-riak-debug/commands/
blockdev._dev_sda1 df dmesg hostname last ps sestatus sysctl vmstat
date df_i free java_version mount rpm swapon uname w
Pull 1040 commands folder
[root@localhost ~]# ls riak@127.0.0.1-riak-debug/commands/
blockdev._dev_sda1 diskstats java_version mount riak_ping riak_status sestatus
cpuinfo dmesg last ps riak_repl_clusterstats riak_transfers swapon
date free limits.conf redhat_release riak_repl_connections riak_version sysctl
df hostname limits.d riak_diag riak_repl_modes rpm uname
df_i iptables_nat meminfo riak_http_stats riak_repl_status rx_crc_errors vmstat
disk_by_id iptables_rules messages riak_member_status riak_ring_status schedulers w
If I go in to /usr/lib64/riak/bin and execute ./riak-debug
in pull 1047, I get almost correct behaviour. It pulls everything it is meant to, deletes its folder in /tmp and creates a .tar.gz in the folder I executed ./riak-debug
from. This direct calling is what the symlinks achieved in pull 1040.
Given the similarity, of the two pull requests,, I think we either add riaknostic to 1040 or add the symlinks to 1047.
The problem with the symlink, is that I've had them crash an install (when the symlink is already present). This changes the presence of riak debug into something that is potentially harmful for the people that don't use it.
I will try and look deeper later today.
Could we get around the pre-existing symlink issue by adding a rm -rf /usr/sbin/riak-debug
after
preun
# Pre-uninstall script
in rel/pkg/rpm/specfile
and to rel/pkg/deb/debian/postrm
? By deleting the symlinks on uninstall, they should not be an issue if they are re-created.
Alternatively we could try something like [ -L /usr/sbin/riak-debug ] || (ln -s /usr/lib64/riak/bin/riak-debug /usr/sbin/riak-debug )
so we only attempt to create the symlink if it doesn't exist. Obviously there would be different paths for Ubuntu/Debian vs Rhel/CentOS.
Please note the above are off the top of my head and not tested yet.
As I understand it, the link is only resolving the problem in that it now allows you to run:
sudo riak-debug
i.e. this works without the link if you ran:
sudo /usr/lib/riak/bin/riak-debug
The command we want to run:
sudo riak debug
doesn't work as expected with or without the link. It in fact behaves in a strange way, acting as if you passed in config options to produce some stats and not generate a file. If you pass in config switches it behaves in bizarre ways e.g:
sudo riak debug -r
will produce the riak outputs, and will now attempt to create the tar/zip file, but will fail as it tries to dump the tar/zip file in a privileged location.
It would perhaps be understandable if sudo riak debug
behaved as if no switches had been passed, but this isn't true.
So something is garbling the call to riak-debug when it goes through the main riak script. We can ignore this and run riak-debug
directly (either by calling it in its actual location or making a symbolic link to the location in the path). But there must exist the concern that this is a canary for a more general problem with passing information through the relx generated riak script.
There's nothing obvious in the actual riak script to explain this:
elx_is_extension() {
EXTENSION=$1
case "$EXTENSION" in
admin|debug|repl|chkconfig|undefined)
echo "1"
;;
*)
echo "0"
;;
esac
}
relx_get_extension_script() {
EXTENSION=$1
# below are the extensions declarations
# of the form:
# foo_extension="path/to/foo_script";bar_extension="path/to/bar_script"
admin_extension="riak-admin";debug_extension="riak-debug";repl_extension="riak-repl";chkconfig_extension="riak-
chkconfig"
# get the command extension (eg. foo) and
# obtain the actual script filename that it
# refers to (eg. "path/to/foo_script"
eval echo "$"${EXTENSION}_extension""
}
relx_run_extension() {
# drop the first argument which is the name of the
# extension script
EXTENSION_SCRIPT=$1
shift
# all extension script locations are expected to be
# relative to the start script location
[ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && . "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"
}
*)
# check for extension
IS_EXTENSION=$(relx_is_extension $1)
if [ "$IS_EXTENSION" = "1" ]; then
EXTENSION_SCRIPT=$(relx_get_extension_script $1)
shift
relx_run_extension $EXTENSION_SCRIPT "$@"
# all extension scripts are expected to exit
else
relx_usage $1
fi
exit 1
;;
esac
and the riak-debug script is explicit what the defaults should be if nothing is passed (and this isn't happening):
if [ 0 -eq $(( $get_cfgs + $get_logs + $get_patches + $get_riakcmds + $get_yzcmds + $get_syscmds + $get_extracmds )
) ]; then
# Nothing specific was requested, so get everything except extracmds
get_cfgs=1
get_logs=1
get_patches=1
get_riakcmds=1
get_yzcmds=1
get_syscmds=1
get_extracmds=0
fi
So I don't think this is because the wrong switches are being passed in e.g. "-s"
If we're running riak debug we're doing this as the riak user. I think the script is failing at some point (probably whilst getting the system data, and as the system data is the first thing to run that's why it looks like it is as if just that switch has been passed). This may be because of privileges - whereas if we run sudo riak-debug
directly were doing this as root.
That's exactly where my previous work got me to. I couldn't find for the life of me why riak debug
called riak-debug -s
and then stopped working at the /tmp directory level. I, possibly foolishly, thought it was only related to riak debug
and, given time constraints, opted to find a temporary work around rather than a solution. Hence the symlinks.
What you've expressed here has raised a concern that there is more potential available for riak [something]
to garble an argument or pass the wrong one to commands other than riak debug
.
OK, so it isn't running riak-debug -s
, it is running riak-debug
- however it is crashing when in that part of the debug script(the collecting of system information), so it looks like it runs -s
as that is the only output you see.
This is why it doesn't try to produce the tar file, the script has already exited before it gets to that point.
So running the script direct works, but crashes when run via usr/sbin/riak
and the relx generated start script.
Even if we stop the su - riak
when we run riak debug, it still crashes. The problem is related to this in relx:
# all extension script locations are expected to be
# relative to the start script location
[ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && . "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"
if we instead use:
[ "$SCRIPT_DIR/$EXTENSION_SCRIPT" ] && "$SCRIPT_DIR/$EXTENSION_SCRIPT" "$@"
The script will now run through to completion (but with errors). So it is this use of ". "$SCRIPT" to call script rather than simply "$SCRIPT" that causes the script to fail.
Precisely what this notation means is hard to discover via google. But obviously, we don't really want to mess about in the relx extensions stuff - so perhaps it might be better to just try and treat debug
differently within usr\sbin\riak
The branch has bene updated to bypass the relx generated script, without using symlinks.
mas@riak-ubuntu18:~/dbroot/riak$ sudo dpkg -i rel/pkg/packages/riak_3.0.2dbg1-OTP22.3_amd64.deb
Selecting previously unselected package riak.
(Reading database ... 184169 files and directories currently installed.)
Preparing to unpack .../riak_3.0.2dbg1-OTP22.3_amd64.deb ...
Unpacking riak (3.0.2dbg1-OTP22.3) ...
Setting up riak (3.0.2dbg1-OTP22.3) ...
Processing triggers for ureadahead (0.100.0-21) ...
Processing triggers for systemd (237-3ubuntu10.42) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
mas@riak-ubuntu18:~/dbroot/riak$ sudo service riak start
mas@riak-ubuntu18:~/dbroot/riak$ sudo riak debug
......................EEEEEE.................................E. ~/dbroot/riak/riak@127.0.0.1-riak-debug.tar.gz
mas@riak-ubuntu18:~/dbroot/riak$
So there are still some errors when running riak-debug, but the script will run through to the end.
@martincox @Bob-The-Marauder any objections to proceeding with this version of the PR?
Alternative PR to https://github.com/basho/riak/pull/1040
This includes updates to riaknostic so
riak admin diag
will work.sudo riak debug appears to work straight from install then start (tested on ubuntu 18)