ContinuumIO / libhdfs3-downstream

a native c/c++ hdfs client (downstream fork from apache-hawq)
Apache License 2.0
40 stars 54 forks source link

Allow for unbuffered reading #10

Closed sk1p closed 6 years ago

sk1p commented 6 years ago

In this dask/hdfs3 issue I detailed that for some I/O-bound workloads, buffering and copying can become a bottleneck. For these cases it would be nice if the buffering could be disabled. I found out that setting input.localread.default.buffersize to a small number and disabling CRC verification results zero memory copies, that is, data is read directly into the buffer provided by the user. That should be exactly the case if this condition is hit.

I think the best long-term way would be to compute the CRC on the user buffer somehow, but in the meantime, it would be nice if verification could be disabled from the C API.

I propose adding a new configuration property, let's say input.read.default.verify, to set the value for verification when opening the file via the C API

Now, my question is: does the configuration only apply to this one place in libhdfs3, or others? And: can we just invent our own configuration parameters - is the namespace somehow divided such that libhdfs3 owns input.*?

martindurant commented 6 years ago

Is there any telling how much of the extra time is in copying and how much in the CRC verification? Of course, I agree that the user should be able to disable verification for the sake of performance, but perhaps the copy can be avoided in every case. I see you mention that in the linked issue.

To your specific question, yes we can add any configuration parameter at the time of creating the C hdfs client. Hadoop services generally ignore any parameter not meant for them anyway, but I would not put this into any site.xml. The user need never know what that config exist, there could be a verify=True flag on HDFSFileSystem.

sk1p commented 6 years ago

Is there any telling how much of the extra time is in copying and how much in the CRC verification?

I'm actually not sure. The profile I'm looking at right now is a bit confusing (all over the place), I'll look at it in detail tomorrow. First guess: the buffering destroys cache efficiency and other parts of the program are affected by that. Anyways, here is the profile with verify=False:

selection_016

And this with verify=True:

selection_017

Both profiles of my python hdf3 testcase recorded with perf record -g -F 1000 python bench_buffering.py (with only read_new_style enabled, dataset size increased to 8GB).

To your specific question, yes we can add any configuration parameter at the time of creating the C hdfs client [...] I would not put this into any site.xml. The user need never know what that config exist

Ahh, nice! Hadn't looked into how the lib is initialized, makes sense now!

sk1p commented 6 years ago

So, the really bad performance was actually a combination of setting a very small input.localread.default.buffersize (which was still set to disable buffering together with disabled verification) and having verification enabled. That's why there is so much Python-related code in the profile: the libhdfs3 functions had to be called many times for the same amount of data. Now, the results are less dramatic (see source, with a prototype patch for disabling verification; 'old style' refers to the version with buffering in the Python interface, 'new style' to the read_into variant):

# verification and buffering enabled
config: {'input.localread.default.buffersize': '1048576', 'input.read.default.verify': '1'}
--- starting timer old style ---
--- stopping timer old style, delta=3.32866 ---
--- starting timer new style ---
--- stopping timer new style, delta=2.07793 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=2.29483 ---

# only buffering, no verification
config: {'input.localread.default.buffersize': '1048576', 'input.read.default.verify': '0'}
--- starting timer old style ---
--- stopping timer old style, delta=2.43243 ---
--- starting timer new style ---
--- stopping timer new style, delta=1.22421 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=1.43843 ---

# no buffering, no verification
config: {'input.localread.default.buffersize': '1', 'input.read.default.verify': '0'}
--- starting timer old style ---
--- stopping timer old style, delta=2.17175 ---
--- starting timer new style ---
--- stopping timer new style, delta=0.91157 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=1.10626 ---

Here is a profile with verification and buffering enabled (more detailed as I took it with perf record --call-graph=dwarf to eliminate almost all the [unknown] symbols):

selection_018

The CRC verification takes up ~34% of samples, memmove takes ~15%. Getting rid of the copy will still be a win, though.

martindurant commented 6 years ago

@sk1p , this is now released on condda-forge https://github.com/conda-forge/libhdfs3-feedstock/pull/19

martindurant commented 6 years ago

If you would like to add a kwarg to HDFileSystem, which passes the correct config to turn off CRC, that would be very reasonable.

sk1p commented 6 years ago

Thanks! PR for crc parameter is ready at https://github.com/dask/hdfs3/pull/164