Closed dr-yhfan closed 2 years ago
The same problem happens in Databricks when using setGraphFolder() property of nerDLApproach..
Below is the code snippet. nerDLTagger = NerDLApproach()\ .setInputCols(["sentence", "token","embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setGraphFolder("dbfs:/data/tensorflow/blstm_6_50_128_103.pb")\ .setMaxEpochs(1)\ .setRandomSeed(0)\ .setVerbose(0)
ner_dl_model = nerDLTagger.fit(training_data)
Error: java.io.FileNotFoundException: file or folder: dbfs:/data/tensorflow/blstm_6_50_128_103.pb not found
We are about to release 2.5.2
which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out.
https://github.com/JohnSnowLabs/spark-nlp/pull/925
fix been released in 2.5.2.
We are about to release
2.5.2
which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out.925
Hello, The issue persist on my HDFS installation when I set a setGraphFolder path on NerDLApproach :
Wrong FS: hdfs://[...].pb, expected: file:///
Are you using 2.5.2? Could you please share the code and the versions of all components in you environment?
I am currently using : spark-nlp 2.5.2 spark 2.4.3 hadoop 2.7.3 pyspark with python3
What else do you need ?
Thanks, also:
What kind of Cluster setup is it? (Cloudera/Hortonworks,etc.)
How did you install/use Spark NLP in PySpark?
Have you tested sparknlp.version() to be sure you correctly updated to 2.5.2?
I have tested myself but I will test again to be sure.
So, I just asked my big data engineer, he told me Hortonworks HDP-2.6.5.0.
I installed Spark NLP :
I've done sparknlp.version() which return 2.5.2
Great thank you. I have a similar setup on Cloudera and will try to reproduce it via PySpark.
This should be possible via Spark Configuration for S3 and NerDLApproach supports S3, HDFS, and DBFS in addition to local fileSystem.
If people are still struggling with setting the GraphFolder in Databricks: For me it worked to write the graph folder into the /mnt/ directory of Databricks, as it is stored outside of the DBFS root. Also make sure to supply the folder in which the graphs are available and not a graph file itself, Tensorflow will find the most suitable graph in that folder.
.setGraphFolder("dbfs:/mnt/your_custom_graph_folder/")
Description
A Zeppelin or Jupyter notebook connected to a Spark cluster (Spark on YARN in client mode) was created to train a model for NER. The Spark on YARN cluster is a service provided by Qubole (https://www.qubole.com) that we are evaluating. We also created a simple Spark on YARN cluster with one node ourselves solely for the purpose of writing this ticket.
Our simple pipeline consists of only
WordEmbeddings
,NerDLApproach
,NerConverter
, andFinisher
. ForWordEmbeddings
, we use the public GloVe fileglove.6B.100d.txt
. ForNerDLApproach
, we defined our custom tags, so we have a combination of tag count, embedding dimension, character count, and LSTM size different from that of the graphs included in the package. We followed the instruction (https://nlp.johnsnowlabs.com/docs/en/graph) to generate a graph with our customized sizes.In order for any executor to be able to access the embeddings file for
WordEmbeddings
and the Tensorflow graph folder forNerDLApproach
at training time, we put both the embeddings file and the graph on Hadoop. Whenfit
is called by the pipeline,WordEmbeddings
has no problem with reading the embeddings file on HDFS.NerDLApproach
finds the graph file on HDFS but throws an exception, asking for a local path instead. Naively, I would assume both of them would accept HDFS paths or both of them would reject HDFS paths. When one accepts HDFS and the other rejects HDFS, either there is a bug or we missed something in our pipeline definition.Expected Behavior
When a Jupyter notebook is connected to Spark cluster on YARN,
NerDLApproach().setGraphFolder(…)
should accept an HDFS path as the location of the graph directory.Current Behavior
The Jupyter notebook shows an error that looks like the following:
We see the exact same behavior both with Qubole’s Spark cluster and with our own Spark cluster.
Possible Solution
Steps to Reproduce
As the user
hadoop
, activate Python 3 virtualenvStart Hadoop cluster
Start Jupyter Enterprise Gateway using the script from the attachment
Start Jupyter Notebook
Establish an SSH tunnel from the local machine to the server running Jupyter Notebook. In our case, we do something like
gcloud compute --project "<GCP project ID>" ssh --zone "us-east1-b" --ssh-flag="-L 8889:localhost:8889 -C -N" "hadoop@<GCP VM name>"
. TCP ports8888
and/or8889
may also need to be open for Jupyter Enterprise Gateway.Open the Jupyter notebook from the attachment in a web browser.
Context
We need to establish an NER system in our production in Q1 2020. We have been evaluating Qubole’s Spark cluster service for the purpose. However, we are blocked from being able to make a decision because of this issue. Qubole’s engineers have been very generously trying to figure out if there is a workaround for us even though we’re past our free-trial period already. We should not keep dragging this on.
Your Environment
This describes our own Spark cluster, not Qubole’s Spark cluster. We are able to reproduce the same issue using our own Spark cluster.
http://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
and unzipped it to/opt/spark
.https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
and unzipped it to/home/hadoop/hadoop-2.7.3
. We followed this instruction (https://tecadmin.net/setup-hadoop-on-ubuntu/) to configure Hadoop.hadoop
user account and/home/hadoop
need to exist.pip install tensorflow==1.12.0
. We followed this instruction (https://tecadmin.net/install-python-3-6-ubuntu-linuxmint/) to install Python 3.6.6 instead of 3.6.9.python3.6 -m venv .jupyter_enterprise_gateway_env
pip install -r requirements.txt
.requirements.txt
is attached to the ticket.spark_python_yarn_client
folder from the attachment to.jupyter_enterprise_gateway_env/share/jupyter/kernels/
. Our version is a modified copy of the example fromhttps://github.com/jupyter/enterprise_gateway/releases/download/v2.0.0/jupyter_enterprise_gateway_kernelspecs-2.0.0.tar.gz
glove.6B.100d.txt
from somewhere and copy it to Hadoop. We cannot attach it to the ticket because it is too large for Github.requirements.txt jupyter_enterprise_gateway_customization.tar.gz supplement_data.tar.gz Example_Notebook_with_Spark_on_YARN_Client_Mode.ipynb.zip