RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

better URI normalization #101

Closed piccolbo closed 12 years ago

piccolbo commented 12 years ago

in equijoin we need the ability to compare the current input with the arguments to equijoin. The way it's implemented now is brittle and makes assumptions that won't be always correct. The problem is that different URI are used to identify the same resource, making string comparison inadequate to identify resources

[<PROTOCOL>://[<HOST>[:<PORT>]]]/<PATH>

where [] stand for optional and their nesting means that if the outer one is missing, the inner one in is too

This captures the current limitations and represents an improvement only as far as clarity, with the exception of including the maprfs protocol. Consider this a strawman. Comments are welcome.

piccolbo commented 12 years ago

given that a) URI don't seem to be particularly useful with MR code as people always refer to the default FS b)the environment variables of interest are not consistent between distros and don't even follow standard URI syntax we implemented a simple hack to strip URI to the path part and use that with default protocol, host and port. In 17d1af818b9bd8cb76ac19fd1430c821e92d9aa8