Redundant hdfs:// prefix for files uploaded by ``Knit.start``

superbobry commented 6 years ago

https://github.com/dask/knit/blob/24a5e33ab5d6b4235cadf79f7e4244eb9935e9b4/knit/core.py#L353

This fragment looks like a bug to me. hdfs_home is supposed to already carry the hdfs:// prefix, therefore here it would be added two times.

superbobry commented 6 years ago

Hey @martindurant, I'm thinking to fix this as follows:

Remove the "hdfs://" at line 353.
Require hdfs_home to have a non-empty schema in the Knit constructor.
Remove triple_slash because it assumes the schema is always hdfs:// which is not the case when using ViewFS.
Replace all startswith("hdfs://") check with ~schema is not empty.

What do you think?

martindurant commented 6 years ago

I am uncertain. Here are some thoughts:

in a number of places, the existence of the protocol is used to differentiate between a local file (needing uploading) and a file on HDFS. As you say, we would need to check for any schema instead. Does that mean we should be doing URI parsing?
hdfs_home is implicitly a path in HDFS, so I would say that a path without a protocol should be acceptable; since 'hdfs://' will likely still be dominant, I would prepend it to a path not having any protocol (which would replace the current use of join); but an error would be OK too.
triple_slash is required because paths in HDFS must always be absolute (the concept of cwd is ambiguous), but users often don't know about this. I don't know if that's the case for other protocols, but probably we need to provide the same function for whichever protocol is in use.

Can the FS protocol be inferred from the config (or inferred from the value of hdfs_home passed), and assumed to be the same everywhere? Should it be a parameter that is applied to all paths?

cc @jcrist

martindurant commented 6 years ago

PS: I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. That would remain true if swapped for arrow/libhdfs (native); but presumably we can use scala code to check for files?

superbobry commented 6 years ago

Thanks for an in-depth reply @martindurant and appologies for the delay on my side.

in a number of places, the existence of the protocol is used to differentiate between a local file (needing uploading) and a file on HDFS. As you say, we would need to check for any schema instead. Does that mean we should be doing URI parsing?

Yes, the solution I was planning is based on urlparse which treats the file is local if it has no schema and remote otherwise (even if the schema is file://).

hdfs_home is implicitly a path in HDFS, so I would say that a path without a protocol should be acceptable; since 'hdfs://' will likely still be dominant, I would prepend it to a path not having any protocol (which would replace the current use of join); but an error would be OK too.

Accepting a path without a schema is OK to me, as long as the following code does not prepend hdfs:// if the schema is present and this behaviour is documented.

triple_slash is required because paths in HDFS must always be absolute (the concept of cwd is ambiguous), but users often don't know about this. I don't know if that's the case for other protocols, but probably we need to provide the same function for whichever protocol is in use.

Well, I see two possible options here: either force the users to use /// in all HDFS paths or implement the same semantics as other HDFS client libraries tools. That is support // and relative paths. This is probably best done on the application master side since it has access to HDFS libraries.

Can the FS protocol be inferred from the config (or inferred from the value of hdfs_home passed), and assumed to be the same everywhere? Should it be a parameter that is applied to all paths?

I think it is specified in core-site.xml, see this SO discussion.

I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. [...]

Do you mean that hdfs3 does not support federation?

martindurant commented 6 years ago

I'll just answer the one question which is the easiest for now.

I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. [...]

Do you mean that hdfs3 does not support federation?

hdfs3 should support viewFS, if it follows the same protocol - I don't know, it's not been tested (your help appreciated!). However, within knit, hdfs3 is used to check for the existence of files in HDFS, to avoid unnecessary uploads. I guess this will break for the alternative URI format.

superbobry commented 6 years ago

I wanted to try hdfs3 on our cluster but probably hit a libhdfs3 limitation (see dask/hdfs3#151).

martindurant commented 6 years ago

@jcrist , did you have any plan to work on this kind of thing in the context of your yarn work? From the above, we could replace hdfs3 with pyarrow.hdfs for knit.

jcrist commented 6 years ago

did you have any plan to work on this kind of thing in the context of your yarn work?

Skein uses the java hdfs libraries directly, and supports the hdfs prefix for all paths. No parsing of the paths or hadoop configuration is done by our code, we just rely on hadoop libraries to do this.

Knit is being superseded by Skein (https://github.com/jcrist/skein). The new library is much more resilient to different hadoop configurations, and more flexible for deploying custom applications. If your intent is to deploy dask on yarn, dask-yarn (http://dask-yarn.readthedocs.io/) has been rewritten to use skein instead. Closing.

dask / knit

Redundant hdfs:// prefix for files uploaded by ``Knit.start`` #118