AayushSameerShah commented 1 year ago

Hello, Sorry for posting this under the "documentation" label, but I thought it is more appropriate than the "bug". The problem I am facing is currently of two types.

1️⃣ Model is not downloading at all 2️⃣ Model can be downloaded but cannot do the inference.

Let me brief you about them.

Before going through any of the problems, I would provide some code which I use as a starter to drive the context:

# Using the colab for doing these testing
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# Start the server
import sparknlp
spark = sparknlp.start(gpu=True)

print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))
>>>Spark NLP version: 5.0.1
>>> Apache Spark version: 3.2.3

from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *
from sparknlp.annotator import *

Now, let me show the error-prone code.

1️⃣ Model is not downloading at all

📝 The model page here: Page

Model Name: bart_large_cnn

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
            .setTask("summarize:") \
            .setMaxOutputLength(200) \
            .setInputCols(["documents"]) \
            .setOutputCol("summaries")

pipeline = Pipeline().setStages([documentAssembler, bart])

Throws this error:

(error in short)

raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
[OK!]

(error in long)

bart_large_cnn download started this may take some time.
Approximate size to download 1 GB
[ — ]----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 45808)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 262, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 235, in poll
    if func():
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 239, in accum_updates
    num_updates = read_int(self.rfile)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
[OK!]
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
[<ipython-input-12-6cb90d39d472>](https://localhost:8080/#) in <cell line: 1>()
----> 1 bart = BartTransformer.pretrained("bart_large_cnn") \
      2             .setTask("summarize:") \
      3             .setMaxOutputLength(200) \
      4             .setInputCols(["documents"]) \
      5             .setOutputCol("summaries")

8 frames
[/usr/local/lib/python3.10/dist-packages/py4j/protocol.py](https://localhost:8080/#) in get_return_value(answer, gateway_client, target_id, name)
    332                     format(target_id, ".", name, value))
    333         else:
--> 334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
    336                 format(target_id, ".", name))

Py4JError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel

2️⃣ Model is downloaded but can't run inference

I have seen this problem in these two models. 📝 Model-1 page here: Page-1 📝 Model-2 page here: Page-2

The code:

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("distilbart_cnn_6_6") \
            .setTask("summarize:") \
            .setInputCols(["documents"]) \
            .setOutputCol("summaries") \
            .setMaxOutputLength(128) \
            .setTemperature(.2) \
            .setDoSample(True) 

pipeline = Pipeline().setStages([documentAssembler, bart])

After the successful download I run the inference like below:

data = spark.createDataFrame([["A LONG PARAGRAPH"]]).toDF("text")
result = pipeline.fit(data).transform(data)

summary = []
for row in result.select("summaries").collect():
    summary.append(row["summaries"][0]["result"])

And it gives this error:

Py4JJavaError                             Traceback (most recent call last)
[<ipython-input-11-c4bd6196dafa>](https://localhost:8080/#) in <cell line: 5>()
      3 result = pipeline.fit(data).transform(data)
      4 summary = []
----> 5 for row in result.select("summaries").collect():
      6     summary.append(row["summaries"][0]["result"])
      7 

3 frames
[/usr/local/lib/python3.10/dist-packages/py4j/protocol.py](https://localhost:8080/#) in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o313.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 8) (6fce83427708 executor driver): org.tensorflow.exceptions.TFInvalidArgumentException: 2 root error(s) found.
  (0) INVALID_ARGUMENT: required broadcastable shapes
     [[{{function_node __inference_decoder_cached_serving_808693}}{{node decoder/layers.0/encoder_attn/add}}]]
     [[StatefulPartitionedCall/_2389]]
  (1) INVALID_ARGUMENT: required broadcastable shapes
     [[{{function_node __inference_decoder_cached_serving_808693}}{{node decoder/layers.0/encoder_attn/add}}]]
0 successful operations.
0 derived errors ignored.

It's a long error... but it will give the context. Seems like a problem in .collect() and also I tried using the .select("summaries").first() and still gives some other related error.

Thank you 🙏🏻

DevinTDHa commented 1 year ago

Hi, thanks for reporting! @prabod I think this might be related to your feature?

AayushSameerShah commented 1 year ago

Hie @prabod ! Any update on this? I have found other models too which are causing this issue 😢

Thanks!

maziyarpanahi commented 1 year ago

@AayushSameerShah I tried to reproduce this issue, but I cannot reproduce it on Colab and locally via the GPU.

You can check this notebook:

it uses the latest PySpark and Spark NLP (so no issue here)
uses one of the models you mentioned (so no issue here)
it downloads it (so no issue here)
and it uses the GPU (just in case, so no issue here)

https://colab.research.google.com/drive/1XyZ6ibezz275QCo9zRlbHeHd7fLqfuWU?usp=sharing

I am afraid we might need the actual issue template with all the details and required versions so we can try to reproduce it. (you chose a documentation template which does not really ask for extra information)

issue template: https://github.com/JohnSnowLabs/spark-nlp/issues/new?assignees=maziyarpanahi&labels=question&projects=&template=bug_report.yml

AayushSameerShah commented 1 year ago

Hie @maziyarpanahi 👋🏻 Thanks for the code and solution... I have tried the notebook that you've provided. There are some problems still, on the first run the inference can be done, but when we change the parameters like DoSample it gives the same collectionError.

I know I should have used the "issue" template, I apologize for this. Here I have created a colab that is linearly runnable and is able to reproduce the error. It is annotated so that should be easy to follow around.

Link: https://colab.research.google.com/drive/1xsYGjHcPxnh4e5UZxYj-hLT7BCFYP_0H?usp=sharing

Please let me know if it doesn't work. Thank you so much! 🤗

AayushSameerShah commented 1 year ago

@maziyarpanahi Hope you were able to catch that error 🤗

maziyarpanahi commented 1 year ago

Hi @AayushSameerShah yes, your notebook was very helpful and we are debugging it currently.

AayushSameerShah commented 1 year ago

Hello! Thank you for addressing this issue... but I am unsure if I am able to run this properly. For example, I am still using the same colab as given above in the link. Trying to download the bart_cnn model but still shows the error.

Thanks.

maziyarpanahi commented 1 year ago

This code and model that was previously failing now works in 5.0.2:

https://colab.research.google.com/drive/1wSkD0R4yQszc0WKDUcPrKqyfG7zJolBi?usp=sharing

A few things to keep in mind:

When the temperature and the k is too low it couldn't find anything. So we fixed this by falling back on more deterministic approach
BART is heavy, so some of those models cannot fit in the free Colab. (the CNN, the Large, they just cannot fit in the memory so it crashes which is just different)

AayushSameerShah commented 1 year ago

Hie, I am not really sure if this is regarding because of the "large model". Because this time I tried to use the https://sparknlp.org/2023/01/30/t5_efficient_large_dl2_en.html model which is around 700MB and... I suppose it is sufficient for the 12GB RAM instance.

In the image above, I have tried to download the model first and then trying to load it... expecting it as a workaround...

😕

maziyarpanahi commented 1 year ago

Hi @AayushSameerShah The initial bug report was for BART annotator, this seems to be T5. (I also see Java side is empty which means you have issue with either starting your SparkSession or it was started and it was killed due to OOM)

In any case, if you have a notebook that can be reproduced (without loading an offline model), please create a new issue regarding T5 annotator.

AayushSameerShah commented 1 year ago

Yeah, true the original issue was regarding the BART model... but the general error seems to be the same. In both cases, the same error was being thrown. Sure, I will open the issue for the T5 model with reproducible code.

Thanks.

JohnSnowLabs / spark-nlp

The summarization model(s) are not giving any result #13898

1️⃣ Model is not downloading at all

Throws this error:

2️⃣ Model is downloaded but can't run inference