Open PerilousApricot opened 5 years ago
@PerilousApricot let me understand - some library will call readv()/writev() on Hadoop interface level (hadoop does not have it however) and this will be translated to xrootd-client calls, am I correct?
Correct. I'd use reflection to get it -- It's dark in this basement.
When reading files via Xrootd with Spark (https://github.com/spark-root/laurelin) doing profiling with the code shows there's significant RTTs being burned because the HadoopFile interface doesn't support vectorized read/writes, meaning each TTree basket incurs its own RTT penalty compared with (e.g.) CMSSW who issues reads for multiple baskets with a single preadv() call. Not to mention, the backing filesystem on the other end typically supports vectorized I/O as well, so it would be a win on that side too.
If hadoop-xrootd was to implement a readv()/writev() interface, I could use that to vastly reduce the number of I/O round-trips for spark. XrdCl itself supports this via the synchronous
XrdCl::File::VectorRead
call, so if that C++ function could be exported up toXRootDClFile
and thenXRootDInputStream
, I can use that to issue a single read at a time instead of potentially hundreds/thousands.