crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
236 stars 59 forks source link

No JAVA_HOME at run time makes Pydoop very slow #344

Open simleo opened 5 years ago

simleo commented 5 years ago

338 added JAVA_HOME auto detection. That's convenient, especially at compile time, since it makes the installation process easier. It also allows Pydoop to work with no JAVA_HOME set at run time, which is also convenient, but it turns out that things can be much slower in that case. Running the entire unit tests suite (minus the avro ones) with no JAVA_HOME is almost 5 times slower. HADOOP_HOME also has an effect, though not nearly as big (a quick comparison on my laptop resulted in 344s with both unset, 75s with JAVA_HOME set and 70s with both set).

Reviewing our caching of these variables (or lack thereof) might help, although not in the case where one is running several Python processes that use Pydoop (auto detection needs to be performed at least once). We do need to document this properly though, so that users make sure they have the most efficient run time setup.