dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Add __next__ (and next) to HDFile #121

Closed gglanzani closed 7 years ago

gglanzani commented 7 years ago

This allows for libraries as pandas to read a file as a buffer.

martindurant commented 7 years ago

Whilst I don't see any reason not to implement these, I am surprised that you find __iter__ is not enough. HDFiles are routinely used in conjunction with pandas - what kind of problem were you having? Furthermore, it seems to me like next could more simply call readline(); could you add some tests, please, to ensure that the functionality works as required?

gglanzani commented 7 years ago

pandas check for next explicitly, so it fails when you do

with hdfs.open(path_to_csv) as f:
    df = pd.read_csv(f)

I will look into readline and tests!

gglanzani commented 7 years ago

Thanks for the comment.

I've updated next and __next__ for now (@pitrou: thanks for the tip, I was sure there was an easier way).

@martindurant As for wrapping in TextIOWrapper: do you have some pointers on the path to take?

On the side: can this be merged in the meantime? It would be very helpful to have this feature.

martindurant commented 7 years ago

The wrapper should work like

import io
with hdfs.open(path_to_csv) as f:
    df = pd.read_csv(io.TextIOWrapper(f))

where Pandas would now see a text-mode file with buffering and correct line-end handling.

I would merge, but there ought to be some test of the new method(s). I notice, also, that _genline is now essentially repeated, so could refactor - but this it not important.

gglanzani commented 7 years ago

@martindurant I've wrote an additional test.

BTW, pandas is using readline() to read f :)

martindurant commented 7 years ago

Cool, thank you.