NerDLApproach does not accept HDFS path for graph folder

dr-yhfan commented 4 years ago

Description

A Zeppelin or Jupyter notebook connected to a Spark cluster (Spark on YARN in client mode) was created to train a model for NER. The Spark on YARN cluster is a service provided by Qubole (https://www.qubole.com) that we are evaluating. We also created a simple Spark on YARN cluster with one node ourselves solely for the purpose of writing this ticket.

Our simple pipeline consists of only WordEmbeddings, NerDLApproach, NerConverter, and Finisher. For WordEmbeddings, we use the public GloVe file glove.6B.100d.txt. For NerDLApproach, we defined our custom tags, so we have a combination of tag count, embedding dimension, character count, and LSTM size different from that of the graphs included in the package. We followed the instruction (https://nlp.johnsnowlabs.com/docs/en/graph) to generate a graph with our customized sizes.

In order for any executor to be able to access the embeddings file for WordEmbeddings and the Tensorflow graph folder for NerDLApproach at training time, we put both the embeddings file and the graph on Hadoop. When fit is called by the pipeline, WordEmbeddings has no problem with reading the embeddings file on HDFS. NerDLApproach finds the graph file on HDFS but throws an exception, asking for a local path instead. Naively, I would assume both of them would accept HDFS paths or both of them would reject HDFS paths. When one accepts HDFS and the other rejects HDFS, either there is a bug or we missed something in our pipeline definition.

Expected Behavior

When a Jupyter notebook is connected to Spark cluster on YARN, NerDLApproach().setGraphFolder(…) should accept an HDFS path as the location of the graph directory.

Current Behavior

The Jupyter notebook shows an error that looks like the following:

Py4JJavaError: An error occurred while calling o38.fit.
: java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/hadoop/app_supplement_data/tf_graphs/blstm-noncontrib_13_100_128_100.pb, expected: file:///

We see the exact same behavior both with Qubole’s Spark cluster and with our own Spark cluster.

Possible Solution

Steps to Reproduce

As the user hadoop, activate Python 3 virtualenv

source ~/.jupyter_enterprise_gateway_env/bin/activate

Start Hadoop cluster
```
start-dfs.sh
start-yarn.sh
```
Start Jupyter Enterprise Gateway using the script from the attachment
```
./start_jupyter_enterprise_gateway.sh
```

Start Jupyter Notebook

jupyter notebook --gateway-url=http://127.0.0.1:8888 --GatewayClient.http_user=guest --GatewayClient.http_pwd=guest-password

Establish an SSH tunnel from the local machine to the server running Jupyter Notebook. In our case, we do something like gcloud compute --project "<GCP project ID>" ssh --zone "us-east1-b" --ssh-flag="-L 8889:localhost:8889 -C -N" "hadoop@<GCP VM name>". TCP ports 8888 and/or 8889 may also need to be open for Jupyter Enterprise Gateway.
Open the Jupyter notebook from the attachment in a web browser.

Context

We need to establish an NER system in our production in Q1 2020. We have been evaluating Qubole’s Spark cluster service for the purpose. However, we are blocked from being able to make a decision because of this issue. Qubole’s engineers have been very generously trying to figure out if there is a workaround for us even though we’re past our free-trial period already. We should not keep dragging this on.

Your Environment

This describes our own Spark cluster, not Qubole’s Spark cluster. We are able to reproduce the same issue using our own Spark cluster.

Spark NLP version: 2.3.4
Apache Spark version: 2.4.0. We downloaded http://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz and unzipped it to /opt/spark.
```
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
```

Hadoop version: 2.7.3. We downloaded https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz and unzipped it to /home/hadoop/hadoop-2.7.3. We followed this instruction (https://tecadmin.net/setup-hadoop-on-ubuntu/) to configure Hadoop.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_HOME=/home/hadoop/hadoop-2.7.3
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Setup and installation (Pypi, Conda, Maven, etc.):
- hadoop user account and /home/hadoop need to exist.
- Python 3.6.6, because it is the highest version of Python 3 that still allows pip install tensorflow==1.12.0. We followed this instruction (https://tecadmin.net/install-python-3-6-ubuntu-linuxmint/) to install Python 3.6.6 instead of 3.6.9.
- Virtualenv, e.g., python3.6 -m venv .jupyter_enterprise_gateway_env
- pip install -r requirements.txt. requirements.txt is attached to the ticket.
- Copy the spark_python_yarn_client folder from the attachment to .jupyter_enterprise_gateway_env/share/jupyter/kernels/. Our version is a modified copy of the example from https://github.com/jupyter/enterprise_gateway/releases/download/v2.0.0/jupyter_enterprise_gateway_kernelspecs-2.0.0.tar.gz
- Copy the data files from the attachment to Hadoop.
```
hdfs dfs -copyFromLocal sample_training_data.parquet /user/hadoop/
hdfs dfs -mkdir /user/hadoop/app_supplement_data
hdfs dfs -copyFromLocal tf_graphs /user/hadoop/app_supplement_data
```
- Get a copy of glove.6B.100d.txt from somewhere and copy it to Hadoop. We cannot attach it to the ticket because it is too large for Github.
```
hdfs dfs -mkdir /user/hadoop/app_supplement_data/word_embeddings
hdfs dfs -copyFromLocal <path to glove.6B.100d.txt> /user/hadoop/app_supplement_data/word_embeddings/
```
Operating System and version: Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-1026-gcp x86_64)
Link to your project (if any):

requirements.txt jupyter_enterprise_gateway_customization.tar.gz supplement_data.tar.gz Example_Notebook_with_Spark_on_YARN_Client_Mode.ipynb.zip

dagarkatyal commented 4 years ago

The same problem happens in Databricks when using setGraphFolder() property of nerDLApproach..

Below is the code snippet. nerDLTagger = NerDLApproach()\ .setInputCols(["sentence", "token","embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setGraphFolder("dbfs:/data/tensorflow/blstm_6_50_128_103.pb")\ .setMaxEpochs(1)\ .setRandomSeed(0)\ .setVerbose(0)

ner_dl_model = nerDLTagger.fit(training_data)

Error: java.io.FileNotFoundException: file or folder: dbfs:/data/tensorflow/blstm_6_50_128_103.pb not found

maziyarpanahi commented 4 years ago

We are about to release 2.5.2 which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out. https://github.com/JohnSnowLabs/spark-nlp/pull/925

maziyarpanahi commented 4 years ago

fix been released in 2.5.2.

alibell commented 4 years ago

We are about to release 2.5.2 which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out.

925

Hello, The issue persist on my HDFS installation when I set a setGraphFolder path on NerDLApproach :

Wrong FS: hdfs://[...].pb, expected: file:///

maziyarpanahi commented 4 years ago

Are you using 2.5.2? Could you please share the code and the versions of all components in you environment?

alibell commented 4 years ago

I am currently using : spark-nlp 2.5.2 spark 2.4.3 hadoop 2.7.3 pyspark with python3

What else do you need ?

maziyarpanahi commented 4 years ago

Thanks, also:

What kind of Cluster setup is it? (Cloudera/Hortonworks,etc.)
How did you install/use Spark NLP in PySpark?
Have you tested sparknlp.version() to be sure you correctly updated to 2.5.2?

I have tested myself but I will test again to be sure.

alibell commented 4 years ago

So, I just asked my big data engineer, he told me Hortonworks HDP-2.6.5.0.

I installed Spark NLP :

With conda for python package
With --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.2 on my PYSPARK_SUBMIT_ARGS environment variable in kernels.json (jupyterhub)

I've done sparknlp.version() which return 2.5.2

maziyarpanahi commented 4 years ago

Great thank you. I have a similar setup on Cloudera and will try to reproduce it via PySpark.

maziyarpanahi commented 2 years ago

This should be possible via Spark Configuration for S3 and NerDLApproach supports S3, HDFS, and DBFS in addition to local fileSystem.

FreekBoeldersEntis commented 2 years ago

If people are still struggling with setting the GraphFolder in Databricks: For me it worked to write the graph folder into the /mnt/ directory of Databricks, as it is stored outside of the DBFS root. Also make sure to supply the folder in which the graphs are available and not a graph file itself, Tensorflow will find the most suitable graph in that folder.

.setGraphFolder("dbfs:/mnt/your_custom_graph_folder/")

JohnSnowLabs / spark-nlp