cerndb / hadoop-xrootd

Mirror of CERN db/hadoop-xrootd. Hadoop-XRootD Filesystem Connector
Apache License 2.0
6 stars 3 forks source link

Support readV/writeV #11

Open PerilousApricot opened 5 years ago

PerilousApricot commented 5 years ago

When reading files via Xrootd with Spark (https://github.com/spark-root/laurelin) doing profiling with the code shows there's significant RTTs being burned because the HadoopFile interface doesn't support vectorized read/writes, meaning each TTree basket incurs its own RTT penalty compared with (e.g.) CMSSW who issues reads for multiple baskets with a single preadv() call. Not to mention, the backing filesystem on the other end typically supports vectorized I/O as well, so it would be a win on that side too.

If hadoop-xrootd was to implement a readv()/writev() interface, I could use that to vastly reduce the number of I/O round-trips for spark. XrdCl itself supports this via the synchronous XrdCl::File::VectorRead call, so if that C++ function could be exported up to XRootDClFile and then XRootDInputStream, I can use that to issue a single read at a time instead of potentially hundreds/thousands.

mrow4a commented 5 years ago

@PerilousApricot let me understand - some library will call readv()/writev() on Hadoop interface level (hadoop does not have it however) and this will be translated to xrootd-client calls, am I correct?

PerilousApricot commented 5 years ago

Correct. I'd use reflection to get it -- It's dark in this basement.