Closed jcrist closed 6 years ago
Alternatively, I'm wondering about moving all of this functionality to live in dask/bytes/hdfs.py
, and moving the tests as well.
This code was originally moved out of distributed to consolidate the hdfs testing infrastructure, however this has led to changes in distributed breaking tests here, and (see #147), which wasn't caught due to lack of these tests being regularly run. Since distributed specific functionality has been removed from filesystem handling in dask/dask#3079, these tests could be moved into dask/dask
.
I propose the following:
dask/bytes/hdfs.py
(only a couple of lines of code)dask/bytes/tests/test_hdfs.py
. These are run as a daily cron job in the dask repo, but not on travis PRs. This could be modified to run if a keyword is in the PR name, which could be then used to run the hdfs tests if the change may affect hdfs handling. Travis provides configurable hooks for doing this (https://docs.travis-ci.com/user/conditional-builds-stages-jobs/).The benefits of this is that the tests get run, and get run in a central location that has more eyes on it than dask/hdfs3
. It also removes anything dask-specific from the hdfs3
repo. In the long run I anticipate the dask wrapper for this filesystem going away completely, which would make the effect of this just adding integration tests to the dask repo. Another benefit is that it makes adding dask integration tests with other hdfs libraries (e.g. pyarrow's libhdfs
wrapper https://github.com/dask/dask/issues/3046) easier, as testing that in this repo wouldn't make much sense.
An alternative suggestion would be to create an external repo (dask/dask-integration-tests
?) that includes tests such as these. This might seem more appealing at first, but has the following drawbacks:
dask/dask
, dask/distributed
, and any other relevant libraries and the actual integration tests failing. Ideally your code would fail in the PR, not later in the daily cron job.dask/dask
, which has a steady stream of contributors.As such, I believe that an external testing repo would be less beneficial than having a conditional build inside dask
proper.
cc @mrocklin, @martindurant, @TomAugspurger for thoughts.
I'm happy to defer to @jcrist's opinion here. I think that he has the most hands-on experience here.
On Thu, Jan 18, 2018 at 5:29 PM, Jim Crist notifications@github.com wrote:
Alternatively, I'm wondering about moving all of this functionality to live in dask/bytes/hdfs.py, and moving the tests as well.
This code was originally moved out of distributed to consolidate the hdfs testing infrastructure, however this has led to changes in distributed breaking tests here, and (see #147 https://github.com/dask/hdfs3/pull/147), which wasn't caught due to lack of these tests being regularly run. Since distributed specific functionality has been removed from filesystem handling in dask/dask#3079 https://github.com/dask/dask/pull/3079, these tests could be moved into dask/dask.
I propose the following:
- The dask-specific filesystem is moved into dask/bytes/hdfs.py (only a couple of lines of code)
- The dask-specific tests are moved to dask/bytes/tests/test_hdfs.py. These are run as a daily cron job in the dask repo, but not on travis PRs. This could be modified to run if a keyword is in the PR name, which could be then used to run the hdfs tests if the change may affect hdfs handling. Travis provides configurable hooks for doing this ( https://docs.travis-ci.com/user/conditional-builds-stages-jobs/ https://docs.travis-ci.com/user/conditional-builds-stages-jobs/).
The benefits of this is that the tests get run, and get run in a central location that has more eyes on it than dask/hdfs3. It also removes anything dask-specific from the hdfs3 repo. In the long run I anticipate the dask wrapper for this filesystem going away completely, which would make the effect of this just adding integration tests to the dask repo. Another benefit is that it makes adding dask integration tests with other hdfs libraries (e.g. pyarrow's libhdfs wrapper) easier, as testing that in this repo wouldn't make much sense.
An alternative suggestion would be to create an external repo ( dask/dask-integration-tests?) that includes tests such as these. This might seem more appealing at first, but has the following drawbacks:
- There is asynchronous behavior between changes to dask/dask, dask/distributed, and any other relevant libraries and the actual integration tests failing. Ideally your code would fail in the PR, not later in the daily cron job.
- The tests live in an external repo that would have to be modified as changes in these libraries occur.
- Fewer eyes would be on that repo than on dask/dask, which has a steady stream of contributors.
As such, I believe that an external testing repo would be less beneficial than having a conditional build inside dask proper.
cc @mrocklin https://github.com/mrocklin, @martindurant https://github.com/martindurant, @TomAugspurger https://github.com/tomaugspurger for thoughts.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/hdfs3/pull/148#issuecomment-358817007, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszCIHV5-gqUqYehXoO5aPOWBRCt9oks5tL9P2gaJpZM4Rjuq4 .
Yes, it does become much cleaner now that we guarantee that we are passing through only paths.
Do you think there's a way to keep the offsets/blocks code around somewhere in the codebase? It seems like a useful nugget, even if distributed isn't using it at the moment.
I am -1 on a separate integration test repo, it would too easily get moribund as people's attention will be elsewhere; also it would be unclear just how much testing should be pulled out from dask.
On moving these tests, I am uncertain. It is easier to install dask into a hadoop-running container than to bolt hdfs onto dask tests - but I do appreciate your point that this specific code can be argued to belong more to dask than to hdfs3. Would we, though, want to pull in any tests from other backends? We have s3, with it's convenient mock-server, but other things may prove more troublesome.
Do you think there's a way to keep the offsets/blocks code around somewhere in the codebase? It seems like a useful nugget, even if distributed isn't using it at the moment.
The offsets/blocks code in the dask wrapper was just a small wrapper around the per-file one in hdfs.core
, mapping across multiple files. If you suspect people might want to use it, I'll defer to your judgement on where it should move to. I highly doubt anyone is currently using it, since we don't really expose the dask wrapper class (and moved it from distributed to here only a short time ago). If your worry is more that the code snippet is useful and we might want it back, then I'd suggest that we can always access the deleted code via version control to pull it back in.
Would we, though, want to pull in any tests from other backends?
HDFS is a bit different than other backends in that there are multiple drivers for it, and multiple libraries interacting with it. Having hdfs tests in dask proper makes sense rather than having individual dask tests littered through each smaller library. For other backends I think having the dask-related testing in the library is fine.
I plan to go forward on this move. I'll push a PR to dask, then move this PR here back to just deleted the obsolete tests. I think we can just delete all of the dask related stuff here (since we really only exposed this to dask), but will defer to you if you think a deprecation will be needed.
The only reason for deprecation would be for version non-matching, when people update one but not the other library.
Agreed. In the past we have done breaking changes that required updating everything in sync, but I'm not sure that's required here. Up to you.
I've pushed a new PR deprecating the dask integration here. This is only useful if you're running old dask and new hdfs3, which I'd say is unlikely.
I like that you have a test to make sure that the depreciation is happening.
This PR does 3 things (in 3 separate commits):
get_block_locations
(removed from dask in https://github.com/dask/dask/pull/3079).get_block_locations
method on the dask filesystem interface, as it's no longer usedNote that this makes the development version of hdfs3 incompatible with older versions of dask. If this is a problem, the last 2 commits may be removed.