Open gleventhal opened 1 week ago
Hello,
At work, we simply use rsync every minute from each client (hundreds at the moment, not thousands), with a random sleep to prevent hammering the rsync server. See https://github.com/Atoptool/atop/issues/140 for a small discussion about it. We are running diskless machine, and we are keeping only one hour of atop log locally. The retention on the rsync destination is handled separately.
Some precautions about the atop filename to prevent previous file truncation in case of restart or reboot:
# /etc/default/atop
# [...]
# Note: CURDAY is not configurable (see https://github.com/Atoptool/atop/issues/140)
#
# Add hour (and minute) because atop will be restarted every hour
# Add boot time (in epoch format) because node could be rebooted
#CURDAY=`date +%Y%m%d`
CURDAY=$(date +%Y%m%d)_$(date +%H%M)_$(date --date="$(uptime -s)" +%Y%m%d%H%M%S)
# [...]
This is what our rsync script looks like (the destination directory is implicitly created, which is nice):
#!/bin/bash
# Ansible managed, please don't edit this file directly
### PREVENT concurrent execution
# This is useful boilerplate code for shell scripts. Put it at the top of the shell script you want to lock and it'll automatically lock itself on the first run.
# If the env var $FLOCKER is not set to the shell script that is being run, then execute flock and grab an exclusive non-blocking lock (using the script itself as the lock file)
# before re-execing itself with the right arguments. It also sets the FLOCKER env var to the right value so it doesn't run again.
[ "${FLOCKER}" != "$0" ] && exec env FLOCKER="$0" flock --close -en "$0" "$0" "$@" || :
SHORTHOSTNAME=$(hostname -s)
RSYNC_DEST=atop@rsync-server::atop/${SHORTHOSTNAME?}
RSYNC_SRC=/var/log/atop/
RSYNC_SLEEP=$[($RANDOM % 30)]
TIMEOUT=15
sleep ${RSYNC_SLEEP?}
timeout ${TIMEOUT?} /usr/bin/rsync --password-file /etc/atop-rsync.secret -a ${RSYNC_SRC?} ${RSYNC_DEST?}
EXIT_VALUE=$?
if [ ${EXIT_VALUE?} != 0 ]; then
logger -t atop-rsync "atop_rsync.sh from ${RSYNC_SRC?} to ${RSYNC_DEST?} failed (exit value was: ${EXIT_VALUE?})"
exit 1
fi
It works well enough for us at the moment. See the previous mentioned issue
@gleventhal not sure if it helps, but there is a modified version of atop - pcp-atop(1) - in the Performance Co-Pilot (pcp.io) toolkit which supports distributed operation, either directly communicating with remote host, or running on central recording from remote hosts like you seek.
Actually, I am thinking, would prometheus metrics be a good form of exporting data, then we can delegate the data storage, and query to prometheus.
A very incomplete PoC here:
https://github.com/Atoptool/atop/commit/43e1124b85c56836c9cdcd68ec9d8118994af055
What I have in mind is either run a local prometheus to scrape from atop in real time, or aggregate data centrally.
Anyone interested in collaborating? I'm happy to continue my prototyping regardlessly for our data center use cases.
HI, @gleventhal I wrote atophttpd with my colleague together. It reads atop log files and provides HTTP/HTTPS service. Hope this helps.
The atop log file is stored in raw binary format, and different version atop uses different logs. Storing lots of atop logs(may be in several version) into a centralized datastore may be difficult to manage. Instead, atophttpd provides json data, it's more friendly.
I love atop, it's the best tool of its kind for wide use, IMHO. I have many thousands of computers and would like to be able to deal with atop logs in a centralized way without requiring that the log be stored on local disk. I also want to retain at least several weeks of logs for each host.
Is there any recommended procedure or plans to support a centralized datastore or at least any optimizations for running atop with the data file location being a DFS (Ceph, NFS, S3, etc)?