UW-Madison-HEP / xrootd-hdfs

HDFS integration for Xrootd.
Apache License 2.0
4 stars 15 forks source link

Store checksum information as xattr on hdfs #25

Open juztas opened 4 years ago

juztas commented 4 years ago

Since 2.5.0 release, Hadoop supports xattr and it could set the checksum values as xattr and not files under /cksums dir as it is right now.

PerilousApricot commented 4 years ago

Like @bbockelm mentioned -- you can't access xattr from the libhdfs C library, even in the latest trunk, so it will be difficult to access it from this plugin (see https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h)

kreczko commented 2 years ago

While extended attributes are not available in libhdfs C library, XrootD allows now for drop-in checksum plugins. I've written such a plugin in Python, which stores the checksum results in the extended attributes. It is currently under test. It heavily borrows from cephsum plugin.

To try it out, you will need Python >=3.8:

pip install xrdsum[hdfs]

Usage example:

/usr/bin/time -v xrdsum --verbose  --debug get  <HDFS path to file> --read-size 128

xrootd config

# ensure cksum adler32 is included in the tpc directive, in order to caclulate by default on transfer
ofs.tpc cksum adler32 fcreds ?gsi =X509_USER_PROXY autorm xfr 40 pgm /etc/xrootd/xrdcp-tpc.sh

# add this line to trigger external checksum calculation. Would be overwritten by other xrootd.chksum lines
xrootd.chksum max 50 adler32 /etc/xrootd/xrdsum.sh

with /etc/xrootd/xrdcp-tpc.sh containing:

#!/bin/sh

# from https://github.com/snafus/cephsum/blob/master/scripts/xrdcp-tpc.sh
#Original code
#/usr/bin/xrdcp --server -f $1 root://$XRDXROOTD_PROXY/$2

# Get the last two variables as SRC and DST, all others are assumed as additional arguments
OTHERARGS="${@:1:$#-2}"
DSTFILE="${@:$#:1}"
SRCFILE="${@:$#-1:1}"

/usr/bin/xrdcp $OTHERARGS --server -f $SRCFILE root://$XRDXROOTD_PROXY/$DSTFILE

and with /etc/xrootd/xrdsum.sh containing:

#!/usr/bin/env bash

RESULT=$(xrdsum get --store-result --chunk-size 64 --verbose --storage-catalog /etc/xrootd/storage.xml "$1")
ECODE=$?

# XRootD expects return on stdout - checksum followed by a new line
printf "%s\n" "$RESULT"
exit "$ECODE"