dask / hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
http://hdfs3.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
136 stars 40 forks source link

Update to new dask conventions #148

Closed jcrist closed 6 years ago

jcrist commented 6 years ago

This PR does 3 things (in 3 separate commits):

Note that this makes the development version of hdfs3 incompatible with older versions of dask. If this is a problem, the last 2 commits may be removed.

jcrist commented 6 years ago

Alternatively, I'm wondering about moving all of this functionality to live in dask/bytes/hdfs.py, and moving the tests as well.

This code was originally moved out of distributed to consolidate the hdfs testing infrastructure, however this has led to changes in distributed breaking tests here, and (see #147), which wasn't caught due to lack of these tests being regularly run. Since distributed specific functionality has been removed from filesystem handling in dask/dask#3079, these tests could be moved into dask/dask.

I propose the following:

The benefits of this is that the tests get run, and get run in a central location that has more eyes on it than dask/hdfs3. It also removes anything dask-specific from the hdfs3 repo. In the long run I anticipate the dask wrapper for this filesystem going away completely, which would make the effect of this just adding integration tests to the dask repo. Another benefit is that it makes adding dask integration tests with other hdfs libraries (e.g. pyarrow's libhdfs wrapper https://github.com/dask/dask/issues/3046) easier, as testing that in this repo wouldn't make much sense.

An alternative suggestion would be to create an external repo (dask/dask-integration-tests?) that includes tests such as these. This might seem more appealing at first, but has the following drawbacks:

As such, I believe that an external testing repo would be less beneficial than having a conditional build inside dask proper.

cc @mrocklin, @martindurant, @TomAugspurger for thoughts.

mrocklin commented 6 years ago

I'm happy to defer to @jcrist's opinion here. I think that he has the most hands-on experience here.

On Thu, Jan 18, 2018 at 5:29 PM, Jim Crist notifications@github.com wrote:

Alternatively, I'm wondering about moving all of this functionality to live in dask/bytes/hdfs.py, and moving the tests as well.

This code was originally moved out of distributed to consolidate the hdfs testing infrastructure, however this has led to changes in distributed breaking tests here, and (see #147 https://github.com/dask/hdfs3/pull/147), which wasn't caught due to lack of these tests being regularly run. Since distributed specific functionality has been removed from filesystem handling in dask/dask#3079 https://github.com/dask/dask/pull/3079, these tests could be moved into dask/dask.

I propose the following:

The benefits of this is that the tests get run, and get run in a central location that has more eyes on it than dask/hdfs3. It also removes anything dask-specific from the hdfs3 repo. In the long run I anticipate the dask wrapper for this filesystem going away completely, which would make the effect of this just adding integration tests to the dask repo. Another benefit is that it makes adding dask integration tests with other hdfs libraries (e.g. pyarrow's libhdfs wrapper) easier, as testing that in this repo wouldn't make much sense.

An alternative suggestion would be to create an external repo ( dask/dask-integration-tests?) that includes tests such as these. This might seem more appealing at first, but has the following drawbacks:

  • There is asynchronous behavior between changes to dask/dask, dask/distributed, and any other relevant libraries and the actual integration tests failing. Ideally your code would fail in the PR, not later in the daily cron job.
  • The tests live in an external repo that would have to be modified as changes in these libraries occur.
  • Fewer eyes would be on that repo than on dask/dask, which has a steady stream of contributors.

As such, I believe that an external testing repo would be less beneficial than having a conditional build inside dask proper.

cc @mrocklin https://github.com/mrocklin, @martindurant https://github.com/martindurant, @TomAugspurger https://github.com/tomaugspurger for thoughts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/hdfs3/pull/148#issuecomment-358817007, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCIHV5-gqUqYehXoO5aPOWBRCt9oks5tL9P2gaJpZM4Rjuq4 .

martindurant commented 6 years ago

Yes, it does become much cleaner now that we guarantee that we are passing through only paths.

Do you think there's a way to keep the offsets/blocks code around somewhere in the codebase? It seems like a useful nugget, even if distributed isn't using it at the moment.

I am -1 on a separate integration test repo, it would too easily get moribund as people's attention will be elsewhere; also it would be unclear just how much testing should be pulled out from dask.

On moving these tests, I am uncertain. It is easier to install dask into a hadoop-running container than to bolt hdfs onto dask tests - but I do appreciate your point that this specific code can be argued to belong more to dask than to hdfs3. Would we, though, want to pull in any tests from other backends? We have s3, with it's convenient mock-server, but other things may prove more troublesome.

jcrist commented 6 years ago

Do you think there's a way to keep the offsets/blocks code around somewhere in the codebase? It seems like a useful nugget, even if distributed isn't using it at the moment.

The offsets/blocks code in the dask wrapper was just a small wrapper around the per-file one in hdfs.core, mapping across multiple files. If you suspect people might want to use it, I'll defer to your judgement on where it should move to. I highly doubt anyone is currently using it, since we don't really expose the dask wrapper class (and moved it from distributed to here only a short time ago). If your worry is more that the code snippet is useful and we might want it back, then I'd suggest that we can always access the deleted code via version control to pull it back in.

Would we, though, want to pull in any tests from other backends?

HDFS is a bit different than other backends in that there are multiple drivers for it, and multiple libraries interacting with it. Having hdfs tests in dask proper makes sense rather than having individual dask tests littered through each smaller library. For other backends I think having the dask-related testing in the library is fine.


I plan to go forward on this move. I'll push a PR to dask, then move this PR here back to just deleted the obsolete tests. I think we can just delete all of the dask related stuff here (since we really only exposed this to dask), but will defer to you if you think a deprecation will be needed.

martindurant commented 6 years ago

The only reason for deprecation would be for version non-matching, when people update one but not the other library.

jcrist commented 6 years ago

Agreed. In the past we have done breaking changes that required updating everything in sync, but I'm not sure that's required here. Up to you.

jcrist commented 6 years ago

I've pushed a new PR deprecating the dask integration here. This is only useful if you're running old dask and new hdfs3, which I'd say is unlikely.

martindurant commented 6 years ago

I like that you have a test to make sure that the depreciation is happening.