dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Add HDFile.readinto(length, out), re-implement read(...) in terms of readinto(...) #162

Closed sk1p closed 6 years ago

sk1p commented 6 years ago

See https://github.com/dask/hdfs3/issues/160 for the in-depth discussion.

martindurant commented 6 years ago

Immediately looks good :) Can you post your benchmarks due to this change, without changes in libhdfs3, here? I expect that you made the default case of returning bytes faster too.

martindurant commented 6 years ago

AttributeError: 'memoryview' object has no attribute 'nbytes'

Perhaps you need to use m.itemsize * reduce(operator.mul, m.shape) (py2 only) ?

sk1p commented 6 years ago

Oh yeah, forgot about py2. Turns out it also doesn't support creating ctypes arrays from memoryviews, so I now pass the original buffer to .from_buffer(...). Will run benchmarks tomorrow.

sk1p commented 6 years ago

Updated the benchmark - these are the numbers for the different configurations with unpatched libhdfs3:

old_read(length=READ_SIZE): 3.09601              # ~three copies, crc verification, buffer re-allocation
read(length=READ_SIZE): 2.47153                  # two copies, crc verification, buffer re-allocation
read(length=READ_SIZE, out_buffer=True): 2.15100 # single copy, crc verification, buffer re-allocation
read(length=READ_SIZE, out_buffer=buf): 1.97234  # single copy inside libhdfs3 + crc verification
martindurant commented 6 years ago

OK, thank you!