Closed sk1p closed 6 years ago
Is there any telling how much of the extra time is in copying and how much in the CRC verification? Of course, I agree that the user should be able to disable verification for the sake of performance, but perhaps the copy can be avoided in every case. I see you mention that in the linked issue.
To your specific question, yes we can add any configuration parameter at the time of creating the C hdfs client. Hadoop services generally ignore any parameter not meant for them anyway, but I would not put this into any site.xml. The user need never know what that config exist, there could be a verify=True
flag on HDFSFileSystem.
Is there any telling how much of the extra time is in copying and how much in the CRC verification?
I'm actually not sure. The profile I'm looking at right now is a bit confusing (all over the place), I'll look at it in detail tomorrow. First guess: the buffering destroys cache efficiency and other parts of the program are affected by that. Anyways, here is the profile with verify=False
:
And this with verify=True
:
Both profiles of my python hdf3 testcase recorded with perf record -g -F 1000 python bench_buffering.py
(with only read_new_style
enabled, dataset size increased to 8GB).
To your specific question, yes we can add any configuration parameter at the time of creating the C hdfs client [...] I would not put this into any site.xml. The user need never know what that config exist
Ahh, nice! Hadn't looked into how the lib is initialized, makes sense now!
So, the really bad performance was actually a combination of setting a very small input.localread.default.buffersize
(which was still set to disable buffering together with disabled verification) and having verification enabled. That's why there is so much Python-related code in the profile: the libhdfs3 functions had to be called many times for the same amount of data. Now, the results are less dramatic (see source, with a prototype patch for disabling verification; 'old style' refers to the version with buffering in the Python interface, 'new style' to the read_into
variant):
# verification and buffering enabled
config: {'input.localread.default.buffersize': '1048576', 'input.read.default.verify': '1'}
--- starting timer old style ---
--- stopping timer old style, delta=3.32866 ---
--- starting timer new style ---
--- stopping timer new style, delta=2.07793 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=2.29483 ---
# only buffering, no verification
config: {'input.localread.default.buffersize': '1048576', 'input.read.default.verify': '0'}
--- starting timer old style ---
--- stopping timer old style, delta=2.43243 ---
--- starting timer new style ---
--- stopping timer new style, delta=1.22421 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=1.43843 ---
# no buffering, no verification
config: {'input.localread.default.buffersize': '1', 'input.read.default.verify': '0'}
--- starting timer old style ---
--- stopping timer old style, delta=2.17175 ---
--- starting timer new style ---
--- stopping timer new style, delta=0.91157 ---
--- starting timer new style w/ realloc ---
--- stopping timer new style w/ realloc, delta=1.10626 ---
Here is a profile with verification and buffering enabled (more detailed as I took it with perf record --call-graph=dwarf
to eliminate almost all the [unknown]
symbols):
The CRC verification takes up ~34% of samples, memmove
takes ~15%. Getting rid of the copy will still be a win, though.
@sk1p , this is now released on condda-forge https://github.com/conda-forge/libhdfs3-feedstock/pull/19
If you would like to add a kwarg to HDFileSystem, which passes the correct config to turn off CRC, that would be very reasonable.
Thanks! PR for crc
parameter is ready at https://github.com/dask/hdfs3/pull/164
In this dask/hdfs3 issue I detailed that for some I/O-bound workloads, buffering and copying can become a bottleneck. For these cases it would be nice if the buffering could be disabled. I found out that setting
input.localread.default.buffersize
to a small number and disabling CRC verification results zero memory copies, that is, data is read directly into the buffer provided by the user. That should be exactly the case if this condition is hit.I think the best long-term way would be to compute the CRC on the user buffer somehow, but in the meantime, it would be nice if verification could be disabled from the C API.
I propose adding a new configuration property, let's say
input.read.default.verify
, to set the value for verification when opening the file via the C APINow, my question is: does the configuration only apply to this one place in libhdfs3, or others? And: can we just invent our own configuration parameters - is the namespace somehow divided such that libhdfs3 owns
input.*
?