Closed brand17 closed 3 years ago
Hi @brand17, the reason seems to be the "UnsatisfiedLinkError":
py4j.protocol.Py4JJavaError: An error occurred while calling o56.saveAsTable.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
This means that Hadoop isn't properly setup and the native libraries (*.dll) are not found or fail to load. Are the required environment variables (SPARK_HOME, HADOOP_HOME, JAVA_HOME) and the paths to search for executables and binaries properly set up? It's quite a few steps to get Hadoop and Spark running on Windows. But there are many tutorials available on the web which describe the setup. Which were you following?
All these variables are correct. Do I need to add these folders to PATH ? I followed the steps from here https://github.com/commoncrawl/cc-pyspark
There are no instructions given in the cc-pyspark README how to install Spark. Spark itself comes with different packages including Hadoop or based on an already installed Hadoop. Which one did you take? Also not that Steve Loughran recommends to use the more regularly maintained https://github.com/cdarlint/winutils.
add these folders to PATH ?
Yes, resp. these folders have a subfolder "bin/" which needs to be on PATH. Also the winutils.exe must be in one of the folders on PATH. But as said, there are detailed tutorials, eg. https://phoenixnap.com/kb/install-spark-on-windows-10 (of course, you would like to upgrade the more recent versions of Spark and/or Hadoop).
Thanks for your support. I followed the https://phoenixnap.com/kb/install-spark-on-windows-10. And then called & $Env:SPARK_HOME/bin/spark-submit ./server_count.py --num_output_partitions 1 --log_level WARN ./input/test_warc.txt servernames
.
Now I am getting the error:
pyspark.sql.utils.AnalysisException: Can not create the managed table('
servernames
'). The associated location('file:/C:/Users/FA/Google%20Drive/Colab%20Notebooks/Finance/Stock%20prediction/cc-pyspark/spark-warehouse/servernames') already exists.
Actually there is no such a folder. When I run:
& $Env:SPARK_HOME/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")
I am getting error:
pyspark.sql.utils.AnalysisException: Path does not exist: file:/C:/Users/FA/Google Drive/Colab Notebooks/Finance/Stock prediction/cc-pyspark/spark-warehouse/servernames
I think it could be caused by incorrect input. It was not clear for me - how to prepare input for Windows. And I called get-data.sh
from cygwin. As a result I've got the following test_warc.txt:
file:/cygdrive/c/Users/FA.PROJECTOR-MSK/Google Диск/Colab Notebooks/Finance/stock prediction/cc-pyspark/crawl-data/CC-MAIN-2017-13/segments/1490218186353.38/warc/CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz
Is it correct ?
The paths look slightly different because of
%20
)C:/
-> /cygdrive/c
FA
-> FA.PROJECTOR-MSK
I cannot try which path variant will work (I'm on Linux). To eliminate potential errors I would make the base path as simple as possible - short, without any white space and not as sub-folder of a directory with localized names.
It works, thanks
I tried to call:
But getting error:
I installed Hadoop 3.0.0 from here https://github.com/steveloughran/winutils.
I am calling under Windows 7, Python 3.6.6 64 bit, Java 8.
Full log is: