JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.86k stars 710 forks source link

NerDLApproach does not accept HDFS path for graph folder #739

Closed dr-yhfan closed 2 years ago

dr-yhfan commented 4 years ago

Description

A Zeppelin or Jupyter notebook connected to a Spark cluster (Spark on YARN in client mode) was created to train a model for NER. The Spark on YARN cluster is a service provided by Qubole (https://www.qubole.com) that we are evaluating. We also created a simple Spark on YARN cluster with one node ourselves solely for the purpose of writing this ticket.

Our simple pipeline consists of only WordEmbeddings, NerDLApproach, NerConverter, and Finisher. For WordEmbeddings, we use the public GloVe file glove.6B.100d.txt. For NerDLApproach, we defined our custom tags, so we have a combination of tag count, embedding dimension, character count, and LSTM size different from that of the graphs included in the package. We followed the instruction (https://nlp.johnsnowlabs.com/docs/en/graph) to generate a graph with our customized sizes.

In order for any executor to be able to access the embeddings file for WordEmbeddings and the Tensorflow graph folder for NerDLApproach at training time, we put both the embeddings file and the graph on Hadoop. When fit is called by the pipeline, WordEmbeddings has no problem with reading the embeddings file on HDFS. NerDLApproach finds the graph file on HDFS but throws an exception, asking for a local path instead. Naively, I would assume both of them would accept HDFS paths or both of them would reject HDFS paths. When one accepts HDFS and the other rejects HDFS, either there is a bug or we missed something in our pipeline definition.

Expected Behavior

When a Jupyter notebook is connected to Spark cluster on YARN, NerDLApproach().setGraphFolder(…) should accept an HDFS path as the location of the graph directory.

Current Behavior

The Jupyter notebook shows an error that looks like the following:

Py4JJavaError: An error occurred while calling o38.fit.
: java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/hadoop/app_supplement_data/tf_graphs/blstm-noncontrib_13_100_128_100.pb, expected: file:///

We see the exact same behavior both with Qubole’s Spark cluster and with our own Spark cluster.

Possible Solution

Steps to Reproduce

  1. As the user hadoop, activate Python 3 virtualenv

    source ~/.jupyter_enterprise_gateway_env/bin/activate
  2. Start Hadoop cluster

    start-dfs.sh
    start-yarn.sh
  3. Start Jupyter Enterprise Gateway using the script from the attachment

    ./start_jupyter_enterprise_gateway.sh
  4. Start Jupyter Notebook

    jupyter notebook --gateway-url=http://127.0.0.1:8888 --GatewayClient.http_user=guest --GatewayClient.http_pwd=guest-password
  5. Establish an SSH tunnel from the local machine to the server running Jupyter Notebook. In our case, we do something like gcloud compute --project "<GCP project ID>" ssh --zone "us-east1-b" --ssh-flag="-L 8889:localhost:8889 -C -N" "hadoop@<GCP VM name>". TCP ports 8888 and/or 8889 may also need to be open for Jupyter Enterprise Gateway.

  6. Open the Jupyter notebook from the attachment in a web browser.

Context

We need to establish an NER system in our production in Q1 2020. We have been evaluating Qubole’s Spark cluster service for the purpose. However, we are blocked from being able to make a decision because of this issue. Qubole’s engineers have been very generously trying to figure out if there is a workaround for us even though we’re past our free-trial period already. We should not keep dragging this on.

Your Environment

This describes our own Spark cluster, not Qubole’s Spark cluster. We are able to reproduce the same issue using our own Spark cluster.

requirements.txt jupyter_enterprise_gateway_customization.tar.gz supplement_data.tar.gz Example_Notebook_with_Spark_on_YARN_Client_Mode.ipynb.zip

dagarkatyal commented 4 years ago

The same problem happens in Databricks when using setGraphFolder() property of nerDLApproach..

Below is the code snippet. nerDLTagger = NerDLApproach()\ .setInputCols(["sentence", "token","embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setGraphFolder("dbfs:/data/tensorflow/blstm_6_50_128_103.pb")\ .setMaxEpochs(1)\ .setRandomSeed(0)\ .setVerbose(0)

ner_dl_model = nerDLTagger.fit(training_data)

Error: java.io.FileNotFoundException: file or folder: dbfs:/data/tensorflow/blstm_6_50_128_103.pb not found

maziyarpanahi commented 4 years ago

We are about to release 2.5.2 which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out. https://github.com/JohnSnowLabs/spark-nlp/pull/925

maziyarpanahi commented 4 years ago

fix been released in 2.5.2.

alibell commented 4 years ago

We are about to release 2.5.2 which has fixed this in Databricks, however, we would like to be tested in HDFS and DBFS by you guys once is out.

925

Hello, The issue persist on my HDFS installation when I set a setGraphFolder path on NerDLApproach :

Wrong FS: hdfs://[...].pb, expected: file:///

maziyarpanahi commented 4 years ago

Are you using 2.5.2? Could you please share the code and the versions of all components in you environment?

alibell commented 4 years ago

I am currently using : spark-nlp 2.5.2 spark 2.4.3 hadoop 2.7.3 pyspark with python3

What else do you need ?

maziyarpanahi commented 4 years ago

Thanks, also:

I have tested myself but I will test again to be sure.

alibell commented 4 years ago

So, I just asked my big data engineer, he told me Hortonworks HDP-2.6.5.0.

I installed Spark NLP :

I've done sparknlp.version() which return 2.5.2

maziyarpanahi commented 4 years ago

Great thank you. I have a similar setup on Cloudera and will try to reproduce it via PySpark.

maziyarpanahi commented 2 years ago

This should be possible via Spark Configuration for S3 and NerDLApproach supports S3, HDFS, and DBFS in addition to local fileSystem.

FreekBoeldersEntis commented 2 years ago

If people are still struggling with setting the GraphFolder in Databricks: For me it worked to write the graph folder into the /mnt/ directory of Databricks, as it is stored outside of the DBFS root. Also make sure to supply the folder in which the graphs are available and not a graph file itself, Tensorflow will find the most suitable graph in that folder.

.setGraphFolder("dbfs:/mnt/your_custom_graph_folder/")