Closed superbobry closed 6 years ago
Hey @martindurant, I'm thinking to fix this as follows:
"hdfs://"
at line 353.hdfs_home
to have a non-empty schema in the Knit
constructor.triple_slash
because it assumes the schema is always hdfs:// which is not the case when using ViewFS.startswith("hdfs://")
check with ~schema is not empty.What do you think?
I am uncertain. Here are some thoughts:
hdfs_home
is implicitly a path in HDFS, so I would say that a path without a protocol should be acceptable; since 'hdfs://'
will likely still be dominant, I would prepend it to a path not having any protocol (which would replace the current use of join
); but an error would be OK too.triple_slash
is required because paths in HDFS must always be absolute (the concept of cwd
is ambiguous), but users often don't know about this. I don't know if that's the case for other protocols, but probably we need to provide the same function for whichever protocol is in use.Can the FS protocol be inferred from the config (or inferred from the value of hdfs_home
passed), and assumed to be the same everywhere? Should it be a parameter that is applied to all paths?
cc @jcrist
PS: I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. That would remain true if swapped for arrow/libhdfs (native); but presumably we can use scala code to check for files?
Thanks for an in-depth reply @martindurant and appologies for the delay on my side.
in a number of places, the existence of the protocol is used to differentiate between a local file (needing uploading) and a file on HDFS. As you say, we would need to check for any schema instead. Does that mean we should be doing URI parsing?
Yes, the solution I was planning is based on urlparse
which treats the file is local if it has no schema and remote otherwise (even if the schema is file://
).
hdfs_home
is implicitly a path in HDFS, so I would say that a path without a protocol should be acceptable; since'hdfs://'
will likely still be dominant, I would prepend it to a path not having any protocol (which would replace the current use ofjoin
); but an error would be OK too.
Accepting a path without a schema is OK to me, as long as the following code does not prepend hdfs://
if the schema is present and this behaviour is documented.
triple_slash
is required because paths in HDFS must always be absolute (the concept ofcwd
is ambiguous), but users often don't know about this. I don't know if that's the case for other protocols, but probably we need to provide the same function for whichever protocol is in use.
Well, I see two possible options here: either force the users to use /// in all HDFS paths or implement the same semantics as other HDFS client libraries tools. That is support // and relative paths. This is probably best done on the application master side since it has access to HDFS libraries.
Can the FS protocol be inferred from the config (or inferred from the value of hdfs_home passed), and assumed to be the same everywhere? Should it be a parameter that is applied to all paths?
I think it is specified in core-site.xml
, see this SO discussion.
I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. [...]
Do you mean that hdfs3 does not support federation?
I'll just answer the one question which is the easiest for now.
I expect that all the code where hdfs3 tests for the existence of files is useless in the viewFS case. [...]
Do you mean that hdfs3 does not support federation?
hdfs3
should support viewFS, if it follows the same protocol - I don't know, it's not been tested (your help appreciated!). However, within knit
, hdfs3
is used to check for the existence of files in HDFS, to avoid unnecessary uploads. I guess this will break for the alternative URI format.
I wanted to try hdfs3
on our cluster but probably hit a libhdfs3
limitation (see dask/hdfs3#151).
@jcrist , did you have any plan to work on this kind of thing in the context of your yarn work? From the above, we could replace hdfs3 with pyarrow.hdfs for knit.
did you have any plan to work on this kind of thing in the context of your yarn work?
Skein uses the java hdfs libraries directly, and supports the hdfs prefix for all paths. No parsing of the paths or hadoop configuration is done by our code, we just rely on hadoop libraries to do this.
Knit is being superseded by Skein (https://github.com/jcrist/skein). The new library is much more resilient to different hadoop configurations, and more flexible for deploying custom applications. If your intent is to deploy dask on yarn, dask-yarn (http://dask-yarn.readthedocs.io/) has been rewritten to use skein instead. Closing.
https://github.com/dask/knit/blob/24a5e33ab5d6b4235cadf79f7e4244eb9935e9b4/knit/core.py#L353
This fragment looks like a bug to me.
hdfs_home
is supposed to already carry thehdfs://
prefix, therefore here it would be added two times.