crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
235 stars 59 forks source link

Issue while importing pydoop inside pySpark map function #273

Closed snalanagula closed 6 years ago

snalanagula commented 6 years ago

I have a requirement to write to hdfs inside map, hence am shipping pydoop.zip dependency module to all worker nodes using sc.addPyFile options, but when I try importing pydoop.hdfs I get below error.

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, PPHDPWORKR046XX.global.tesco.org): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.5.5.0-157/spark/python/pyspark/worker.py", line 111, in main
    process()
  File "/usr/hdp/2.5.5.0-157/spark/python/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.5.5.0-157/spark/python/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/hdp/2.5.5.0-157/spark/python/pyspark/rdd.py", line 1293, in takeUpToNumLeft
    yield next(iterator)
  File "<ipython-input-13-06c2725f43c3>", line 2, in func
  File "/tmp/pip-build-zvn9Um/pydoop/pydoop-1.2.0.data/platlib/pydoop/__init__.py", line 194, in <module>
  File "/tmp/pip-build-zvn9Um/pydoop/pydoop-1.2.0.data/platlib/pydoop/__init__.py", line 179, in read_properties
IOError: [Errno 20] Not a directory: '/data/1/hadoop/yarn/local/usercache/ex63/appcache/application_1516896450292_51969/container_e310_1516896450292_51969_01_000002/pydoop.zip/pydoop/pydoop.properties'

Steps followed to create pydoop.zip

pip install pydoop -t ./pydoop
cd pydoop
zip -r pydoop.zip pydoop 

Sample pyspark code that I am trying to use pydoop inside map.

from pyspark import SparkContext, SparkConf
SparkContext.setSystemProperty('spark.executor.memory', '4g')
conf = SparkConf().setAppName("pydoop test")
sc=SparkContext(conf=conf)
sc.addPyFile("/home/ex63/pydoop/pydoop.zip")
rdd=sc.parallelize([(12,34,56,67),(34,56,87,354),(345,74,33,77), (453,56,73,56)],2 )
def func(rec):
    from pydoop import hdfs
    print (rec[0])
    print (hdfs.__file__)
    #hdfs.dump("hello", "/user/ex63/temp_{}.txt".format(rec[0]))

rdd.map(func).take(10)

Please help me to resolve this.

simleo commented 6 years ago

Hi, and thanks for reporting this.

Pydoop uses dynamic extension modules, so it's not importable from a zip archive. It should be importable from an egg (also supported by Spark), but this leads to the same error as above. I have just opened issue #276 for this and hope to get to it soon. In the meantime, since you most likely don't need properties anyway, you should be able to work around the problem as follows:

  1. Change pydoop/__init__.py so it does not break when properties are not found:
--- a/pydoop/__init__.py
+++ b/pydoop/__init__.py
@@ -179,9 +179,7 @@ def read_properties(fname):
         with open(fname) as f:
             parser.readfp(AddSectionWrapper(f))
     except IOError as e:
-        if e.errno != errno.ENOENT:
-            raise
-        return None  # compile time, prop file is not there
+        return {}
     return dict(parser.items(AddSectionWrapper.SEC_NAME))
  1. Try the following (on a worker node, or a machine with the same configuration):
git clone --branch 1.2.0 https://github.com/crs4/pydoop
cd pydoop
export HADOOP_HOME=/your/hadoop/home
export JAVA_HOME=/your/java/home
python setup.py build
python setup.py bdist_egg

You should end up with a pydoop-1.2.0-py2.7.egg (or similar) under dist/. Try passing this file to sc.addPyFile instead of the zip one.

snalanagula commented 6 years ago

Hi Simone,

Thanks for reply and workaround, I have followed the steps provided and the import issue is resolved. but when I am trying to do operations with hdfs I am facing issues.

>>> hdfs.ls('/insight_labs/rdf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.linux-x86_64/egg/pydoop/hdfs/__init__.py", line 312, in ls
  File "build/bdist.linux-x86_64/egg/pydoop/hdfs/__init__.py", line 291, in lsl
  File "build/bdist.linux-x86_64/egg/pydoop/hdfs/fs.py", line 150, in __init__
  File "build/bdist.linux-x86_64/egg/pydoop/hdfs/fs.py", line 64, in _get_connection_info
  File "build/bdist.linux-x86_64/egg/pydoop/hdfs/core/__init__.py", line 55, in core_hdfs_fs
RuntimeError: module not initialized, check that Pydoop is correctly installed

I have tried looking at other issues opened, but no one seems to have build the pydoop in this way. Could you please help me resolve this?

[ex63@xxxxx pydoop]$ python -V
Python 2.7.12 :: Continuum Analytics, Inc.

It is printing hadoop version and hadoop class path

>>> import pydoop
>>> pydoop.hadoop_version()
'2.7.3.2.5.5.0-157'
>>> os.environ['JAVA_HOME']
'/usr/java/jdk1.8.0_65'
>>> pydoop.hadoop_classpath()
'/usr/hdp/2.5.5.0-157/hadoop/hadoop-auth-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-common.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-common-2.7.3.2.5.5.0-157-tests.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-aws-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-azure-datalake-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-nfs-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-annotations-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-azure-datalake.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-aws.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-auth.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-common-tests.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-azure.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-annotations.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-azure-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-nfs.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop-common-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-cli-1.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-core-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jettison-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/avro-1.7.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jets3t-0.9.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/hamcrest-core-1.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jersey-json-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/protobuf-java-2.5.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/guava-11.0.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jetty-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/azure-storage-4.2.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jsr305-3.0.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-lang3-3.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jcip-annotations-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jaxb-impl-2.2.3-1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-io-2.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-compress-1.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/snappy-java-1.0.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/azure-keyvault-core-0.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/nimbus-jose-jwt-3.9.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/aws-java-sdk-kms-1.10.6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jsp-api-2.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-logging-1.1.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/junit-4.11.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-configuration-1.6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/api-asn1-api-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/slf4j-log4j12-1.7.10.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/ranger-plugin-classloader-0.6.0.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/curator-client-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/paranamer-2.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/curator-framework-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/ojdbc6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/ranger-hdfs-plugin-shim-0.6.0.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-lang-2.6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jersey-core-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jetty-util-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/httpcore-4.4.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-digester-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/curator-recipes-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/java-xmlbuilder-0.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-math3-3.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/activation-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/netty-3.6.2.Final.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/xmlenc-0.52.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/stax-api-1.0-2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/zookeeper-3.4.6.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/asm-3.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-databind-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/api-util-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jsch-0.1.54.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-collections-3.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-xc-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/servlet-api-2.5.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/json-smart-1.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-codec-1.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-annotations-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/log4j-1.2.17.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/slf4j-api-1.7.10.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jersey-server-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/httpclient-4.5.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/aws-java-sdk-s3-1.10.6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/apacheds-i18n-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-beanutils-1.7.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/xz-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/commons-net-3.1.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-core-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/ranger-yarn-plugin-shim-0.6.0.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jaxb-api-2.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/gson-2.2.4.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/mockito-all-1.8.5.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/aws-java-sdk-core-1.10.6.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-jaxrs-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop/lib/joda-time-2.8.1.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs-nfs-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs-2.7.3.2.5.5.0-157-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/hadoop-hdfs-nfs.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-cli-1.2.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/protobuf-java-2.5.0.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/guava-11.0.2.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jetty-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/leveldbjni-all-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jsr305-3.0.0.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/xercesImpl-2.9.1.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/okhttp-2.4.0.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-io-2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-logging-1.1.3.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-lang-2.6.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-daemon-1.0.13.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jersey-core-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jetty-util-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/netty-3.6.2.Final.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/xmlenc-0.52.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/asm-3.2.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/netty-all-4.0.23.Final.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/servlet-api-2.5.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/commons-codec-1.4.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/log4j-1.2.17.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jersey-server-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/xml-apis-1.3.04.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-hdfs/lib/okio-1.4.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-timeline-pluginstorage.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-common.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-common-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-resourcemanager.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-client-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-applicationhistoryservice.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-client.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-sharedcachemanager-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-api.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-applicationhistoryservice-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-common-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-timeline-pluginstorage-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-web-proxy.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-applications-unmanaged-am-launcher.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-sharedcachemanager.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-common.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-registry.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-tests-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-registry-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-nodemanager-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-nodemanager.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-api-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-web-proxy-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/hadoop-yarn-server-resourcemanager-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-cli-1.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-core-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jettison-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/avro-1.7.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jets3t-0.9.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jersey-json-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/protobuf-java-2.5.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/guava-11.0.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jersey-client-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-beanutils-core-1.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jetty-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/azure-storage-4.2.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/leveldbjni-all-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jsr305-3.0.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/zookeeper-3.4.6.2.5.5.0-157-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-lang3-3.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jcip-annotations-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-io-2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-compress-1.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/snappy-java-1.0.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/azure-keyvault-core-0.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/nimbus-jose-jwt-3.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/metrics-core-3.0.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jsp-api-2.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-logging-1.1.3.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-configuration-1.6.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/api-asn1-api-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/curator-client-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/paranamer-2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/curator-framework-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/aopalliance-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-lang-2.6.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jersey-core-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jetty-util-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/httpcore-4.4.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-digester-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/curator-recipes-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/java-xmlbuilder-0.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-math3-3.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/activation-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/netty-3.6.2.Final.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/objenesis-2.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/xmlenc-0.52.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/stax-api-1.0-2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/zookeeper-3.4.6.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/fst-2.24.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/asm-3.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-databind-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/api-util-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jsch-0.1.54.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-collections-3.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/guice-servlet-3.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-xc-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/servlet-api-2.5.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/json-smart-1.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-codec-1.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-annotations-2.2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/log4j-1.2.17.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jersey-server-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/httpclient-4.5.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/guice-3.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/apacheds-i18n-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-beanutils-1.7.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/javax.inject-1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/xz-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/commons-net-3.1.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-core-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jersey-guice-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jaxb-api-2.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/gson-2.2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/javassist-3.18.1-GA.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-cli-1.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-auth-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jettison-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/avro-1.7.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jets3t-0.9.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hamcrest-core-1.3.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-ant.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jersey-json-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/protobuf-java-2.5.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-extras.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-gridmix.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/guava-11.0.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-hs.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-openstack-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-beanutils-core-1.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jetty-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jsr305-3.0.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/azure-data-lake-store-sdk-2.1.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-lang3-3.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-sls.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jcip-annotations-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.7.3.2.5.5.0-157-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/okhttp-2.4.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-app-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-openstack.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jaxb-impl-2.2.3-1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-hs-plugins.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-io-2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-compress-1.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-datajoin.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/snappy-java-1.0.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-rumen.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/azure-keyvault-core-0.8.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/nimbus-jose-jwt-3.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/metrics-core-3.0.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jsp-api-2.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-rumen-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-app.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-logging-1.1.3.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/junit-4.11.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-jobclient.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-shuffle-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-hs-plugins-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-configuration-1.6.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-auth.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/api-asn1-api-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-streaming.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-core.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/curator-client-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/paranamer-2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-common-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/curator-framework-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-gridmix-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-distcp-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-lang-2.6.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jersey-core-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-extras-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jetty-util-6.1.26.hwx.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/httpcore-4.4.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-digester-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/curator-recipes-2.7.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/java-xmlbuilder-0.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-math3-3.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-sls-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/activation-1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/netty-3.6.2.Final.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/xmlenc-0.52.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/stax-api-1.0-2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/zookeeper-3.4.6.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/asm-3.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-examples-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/api-util-1.0.0-M20.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-archives-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jsch-0.1.54.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-collections-3.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-datajoin-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-shuffle.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-distcp.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-ant-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jackson-xc-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/servlet-api-2.5.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/json-smart-1.1.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-codec-1.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-archives.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/log4j-1.2.17.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jersey-server-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/httpclient-4.5.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/apacheds-i18n-2.0.0-M15.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-client-hs-2.7.3.2.5.5.0-157.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-beanutils-1.7.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/xz-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-net-3.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jackson-core-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jaxb-api-2.2.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/commons-httpclient-3.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/gson-2.2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/mockito-all-1.8.5.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jackson-jaxrs-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/okio-1.4.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/hadoop-mapreduce-examples.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/avro-1.7.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/hamcrest-core-1.3.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/protobuf-java-2.5.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/leveldbjni-all-1.8.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/commons-io-2.4.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/commons-compress-1.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/junit-4.11.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/paranamer-2.3.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/aopalliance-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/jersey-core-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/netty-3.6.2.Final.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/asm-3.2.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/guice-servlet-3.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/log4j-1.2.17.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/jersey-server-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/guice-3.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/javax.inject-1.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/xz-1.0.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/jersey-guice-1.9.jar:/usr/hdp/2.5.5.0-157/hadoop-mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.5.5.0-157/hadoop/hadoop/lib/native:/usr/hdp/2.5.5.0-157/hadoop/etc/hadoop'
simleo commented 6 years ago

Hi,

It looks like it's not at all straightforward to make a Python package that includes native extensions importable from an egg or other archive. However, I've just tried your pyspark sample code and it works for me if I pass the unzipped installation dir to addFile with recursive set to True. For instance:

cd /tmp
pip install pydoop -t .

And in the pyspark code:

from pyspark import SparkContext, SparkConf

SparkContext.setSystemProperty('spark.executor.memory', '4g')
conf = SparkConf().setAppName("pydoop test")
sc = SparkContext(conf=conf)
sc.addFile("/tmp/pydoop", recursive=True)
rdd = sc.parallelize([
    (12, 34, 56, 67),
    (34, 56, 87, 354),
    (345, 74, 33, 77),
    (453, 56, 73, 56)
], 2)

def func(rec):
    import sys
    from pyspark import SparkFiles
    sys.path.insert(0, SparkFiles.get("pydoop"))
    from pydoop import hdfs
    hdfs.dump("hello", "/user/root/temp_{}.txt".format(rec[0]))

rdd.map(func).take(10)

Note that you need to manually alter sys.path, since addFile does not take care of that. I'm using pyspark 2.2.1.

snalanagula commented 6 years ago

Thanks Simone, The option sc.addFile with recursive is working, I have tested this code in my local VM, but unfortunately the cluster where I need this has spark 1.6.3 which does not have recursive parameter for addFile method.

Regards, Srinivas

simleo commented 6 years ago

Hi,

I believe you can still make it work with the older Spark version. Build pydoop.zip as in the original post, add it with sc.addFile("/your/path/to/pydoop.zip"), then you can unpack it on the fly in the worker's code with something like this:

def func(rec):
    import sys
    import zipfile
    import tempdir
    from pyspark import SparkFiles
    zip_fn = SparkFiles.get("pydoop.zip")
    d = tempfile.mkdtemp()
    with zipfile.ZipFile(zip_fn, 'r') as zipf:
        zipf.extractall(d)
    sys.path.insert(0, d)
    from pydoop import hdfs
    hdfs.dump("hello", "/user/root/temp_{}.txt".format(rec[0]))
snalanagula commented 6 years ago

Hi, Thanks for this solution, it is working perfectly fine. The package in PyPi seems very old version (py3compat and other latest fixes were not there). Any plan of uploading the latest package in PyPi?   I tried cloning pydoop from git and build package using python setup.py bdist --format=zip that created pydoop-2.0a0.linux-x86_64.zip, then I unzipped it, I see pydoop folder is under the path   opt/anaconda2/lib/python2.7/site-packages/. To make this avaialbe with in Map I have to manually zip pydoop folder by going into that path . Is there any alternative option to python setup.py to create pydoop.zip(which can be importable after unzip) ?

Regards, Srinivas

simleo commented 6 years ago

I'm going to make an alpha release pretty soon. You should be able to avoid the zip-unzip round trip by simply zipping the contents of build/lib/pydoop after running python setup.py bdist.