commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Can not run server_count example on Windows locally #28

Closed brand17 closed 3 years ago

brand17 commented 3 years ago

I tried to call:

$SPARK_HOME/bin/spark-submit ./server_count.py \
    --num_output_partitions 1 --log_level WARN \
    ./input/test_warc.txt servernames

But getting error:

py4j.protocol.Py4JJavaError: An error occurred while calling o55.saveAsTable

I installed Hadoop 3.0.0 from here https://github.com/steveloughran/winutils.

I am calling under Windows 7, Python 3.6.6 64 bit, Java 8.

Full log is:

21/03/16 14:49:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/16 14:49:08 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes Traceback (most recent call last): File "server_count.py", line 46, in job.run() File "C:\Users\FA.PROJECTOR-MSK\Google Диск\Colab Notebooks\Finance\cc-pyspark\sparkcc.py", line 152, in run self.run_job(sc, sqlc) File "C:\Users\FA.PROJECTOR-MSK\Google Диск\Colab Notebooks\Finance\cc-pyspark\sparkcc.py", line 187, in run_job .saveAsTable(self.args.output) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\pyspark\sql\readwriter.py", line 1158, in saveAsTable self._jwrite.saveAsTable(name) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\py4j\java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\pyspark\sql\utils.py", line 111, in deco return f(*a, **kw) File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\py4j\protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o56.saveAsTable. : java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645) at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230) at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:356) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:753) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:731) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:626) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source)

sebastian-nagel commented 3 years ago

Hi @brand17, the reason seems to be the "UnsatisfiedLinkError":

py4j.protocol.Py4JJavaError: An error occurred while calling o56.saveAsTable.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

This means that Hadoop isn't properly setup and the native libraries (*.dll) are not found or fail to load. Are the required environment variables (SPARK_HOME, HADOOP_HOME, JAVA_HOME) and the paths to search for executables and binaries properly set up? It's quite a few steps to get Hadoop and Spark running on Windows. But there are many tutorials available on the web which describe the setup. Which were you following?

brand17 commented 3 years ago

All these variables are correct. Do I need to add these folders to PATH ? I followed the steps from here https://github.com/commoncrawl/cc-pyspark

sebastian-nagel commented 3 years ago

There are no instructions given in the cc-pyspark README how to install Spark. Spark itself comes with different packages including Hadoop or based on an already installed Hadoop. Which one did you take? Also not that Steve Loughran recommends to use the more regularly maintained https://github.com/cdarlint/winutils.

add these folders to PATH ?

Yes, resp. these folders have a subfolder "bin/" which needs to be on PATH. Also the winutils.exe must be in one of the folders on PATH. But as said, there are detailed tutorials, eg. https://phoenixnap.com/kb/install-spark-on-windows-10 (of course, you would like to upgrade the more recent versions of Spark and/or Hadoop).

brand17 commented 3 years ago

Thanks for your support. I followed the https://phoenixnap.com/kb/install-spark-on-windows-10. And then called & $Env:SPARK_HOME/bin/spark-submit ./server_count.py --num_output_partitions 1 --log_level WARN ./input/test_warc.txt servernames.

Now I am getting the error:

pyspark.sql.utils.AnalysisException: Can not create the managed table('servernames'). The associated location('file:/C:/Users/FA/Google%20Drive/Colab%20Notebooks/Finance/Stock%20prediction/cc-pyspark/spark-warehouse/servernames') already exists.

Actually there is no such a folder. When I run:

& $Env:SPARK_HOME/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")

I am getting error:

pyspark.sql.utils.AnalysisException: Path does not exist: file:/C:/Users/FA/Google Drive/Colab Notebooks/Finance/Stock prediction/cc-pyspark/spark-warehouse/servernames

I think it could be caused by incorrect input. It was not clear for me - how to prepare input for Windows. And I called get-data.sh from cygwin. As a result I've got the following test_warc.txt:

file:/cygdrive/c/Users/FA.PROJECTOR-MSK/Google Диск/Colab Notebooks/Finance/stock prediction/cc-pyspark/crawl-data/CC-MAIN-2017-13/segments/1490218186353.38/warc/CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz

Is it correct ?

sebastian-nagel commented 3 years ago

The paths look slightly different because of

I cannot try which path variant will work (I'm on Linux). To eliminate potential errors I would make the base path as simple as possible - short, without any white space and not as sub-folder of a directory with localized names.

brand17 commented 3 years ago

It works, thanks