Closed kuopching closed 1 year ago
Hi @kuopching
Could you please share a screenshot for the Spark UI Executors
tab? I have a feeling you launch your Spark application from a local machine to a remote cluster, but the Driver becomes the local machine. All the training happens in the Driver, so when it is going to copy/move data it will do it locally while the graph is remote actually.
This is certainly an edge case, but your Spark UI screenshots will help us to set up the env to reproduce this better.
PS: I forgot to ask, we also need a complete description of your env, we see the IP address, localhost, etc. Is this a YARN cluster in a Docker while you are launching your Spark app from a local Windows? Would be great if we can have all the info to reproduce it, this is a very unique case.
Hi @maziyarpanahi , thank you for quick response.
Spark UI screenshot:
I'am starting spark cluster manually without YARN. I'm launching Spark/Hadoop from local Windows machine with cmd:
C:\spark\bin>spark-class org.apache.spark.deploy.master.Master --host 192.168.2.151
C:\spark\bin>spark-class org.apache.spark.deploy.worker.Worker 192.168.2.151
C:\hadoop>sbin\start-dfs.cmd
Spark conf: default from clean installation
Hadoop conf: core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:19000</value>
</property>
</configuration>
Thanks for the extra information. So this is not really a cluster, it's more of a master/slave situation. I am not sure if this is possible in a not managed cluster. (if it was managed like YARN or K8s this would work as it has all the info).
Since it's a standalone cluster and the training (Approach) happens in the Driver, I think it is confusing the Driver/App being your local machine when it is trying to do some File related ops. Is it possible to have your App/Driver also part of the cluster (like YARN cluster-mode)?
PS: I forgot again to ask, what is 192.168.2.151
? Is this a VM? And how does a worker in 192.168.2.151
have access to hdfs://localhost:19000
? (shouldn't fs.defaultFS
be accessible from all the nodes?)
192.168.2.151 is IP of my local machine I replaced localhost with 192.168.2.151 in code and config as well and behavior is the same.
I am not sure if this is possible in a not managed cluster
Is it difference between loading graph and loading dataset/wordembedding from hdfs?
Is it possible to have your App/Driver also part of the cluster (like YARN cluster-mode)?
Can you please elaborate this more? Now I'm launching Spark NLP java project from IDE.
192.168.2.151 is IP of my local machine I replaced localhost with 192.168.2.151 in code and config as well and behavior is the same.
So this is just a standalone Spark like a local mode
local[*]
since there is only 1 machine. Why havingmaster
andworker
set and have.config("spark.submit.deployMode", "cluster")
? (it's not really a cluster in a single machine)I am not sure if this is possible in a not managed cluster
Is it difference between loading graph and loading dataset/wordembedding from hdfs?
Yes, in those cases you are using a native Apache Spark
.read
and.load
which we extended them natively. The Graph file, we have to list that directory, copy it locally, and then read it.
It should be still possible if we tweak a few things, but I want to be sure I understand the use case as why set master/worker
and cluster
as deplyMode? What's the advantage if everything is in the single machine?
Yes, You are right. It is not real cluster so far. The cluster I mean cluster of SPARK workers.
The goal of this exercise is only POC/testing for now. I'm trying to get it up and running and then spread it to other machines. Add more workers.
Anyway I changed .config("spark.submit.deployMode", "cluster")
to.config("spark.submit.deployMode", "client")
Hi, do you need more information from me?
Anyway I changed
.config("spark.submit.deployMode", "cluster")
to.config("spark.submit.deployMode", "client")
Sorry, I thought this was a positive response on your part. But I am assuming deployMode
is only available for YARN and K8 clusters and not for master/worker setup.
Would it be possible to use something more realistic for a cluster like https://github.com/maziyarpanahi/docker-spark-yarn-cluster? I use this for cluster development from my local machine. (not exactly how you launch the app though)
It would be great if you can test this and see if you encounter a similar issue, this way we can easily reproduce it.
Hi @maziyarpanahi , I can confirm that your setup works like charm. Loading graph from hdfs works. Obviously error was the setup thing on my side.
Thank you
@kuopching I am glad that worked out and thanks for confirming this, I appreciate it.
Description
I am training NerDL model using spark-nlp 4.2.4 in Spark Standalone mode with 1 worker. I am not able to acces custom graph in hdfs storage. When i try:
NerDLApproach nerTagger = new NerDLApproach(); nerTagger.setGraphFolder("hdfs://localhost:19000/graph");
then I get error: ERROR Instrumentation: java.lang.IllegalArgumentException: Pathname /C:/Users/USER1/AppData/Local/Temp/sparknlp_tmp_11909547575733209894/graph from C:/Users/USER1/AppData/Local/Temp/sparknlp_tmp_11909547575733209894/graph is not a valid DFS filename.
Expected Behavior
NerDLApproach().setGraphFolder should be able to access custom graphs in the HDFS file system
Current Behavior
Loading graph from local filesystem works with "embedded spark" - com.johnsnowlabs.nlp.SparkNLP.start(). Loading training dataset and WordEmbeding from hdfs works.
Loading custom graph from hdfs doesnt work as expected. Folder /graph contains file blstm_17_300_128_237.pb
Exception is thrown:
Possible Solution
Steps to Reproduce
Context
Unable to train an NerDL model with custom graph
Your Environment
spark-nlp_2.12 4.2.4
:3.3.0
: