daveaitel commented 2 years ago

A exception results from trying to translate a French string through Apache NLP and the Marian models.

Description

127.0.0.1 - - [06/Nov/2021 04:05:26] "POST /langid HTTP/1.1" 200 -
Translate [fr] data: Main Account: @Zorto_ | Since 2009. | Producer | Programmer | zorto@mail.com Xbox: infirmary
[2021-11-06 04:39:47,037] ERROR in app: Exception on /translate [POST]
Traceback (most recent call last):
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/minecraft/CODE/DARPA/sentiment_service.py", line 169, in get_translate
    result += "".join(translatepipeline[langid].annotate(chunk)["translation"])
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 181, in annotate
    return pipeline.annotate(target)
  File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/base.py", line 165, in annotate
    annotations = self._lightPipeline.annotateJava(target)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o950.annotateJava.
: org.tensorflow.exceptions.TFInvalidArgumentException: indices[512] = 512 is not in [0, 512)
         [[{{node decoder/embed_positions/Gather}}]]
        at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:87)
        at org.tensorflow.Session.run(Session.java:691)
        at org.tensorflow.Session.access$100(Session.java:72)
        at org.tensorflow.Session$Runner.runHelper(Session.java:381)
        at org.tensorflow.Session$Runner.run(Session.java:329)
        at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.process(TensorflowMarian.scala:146)
        at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.$anonfun$generateSeq2Seq$2(TensorflowMarian.scala:254)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
        at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
        at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.generateSeq2Seq(TensorflowMarian.scala:251)
        at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.$anonfun$batchAnnotate$3(MarianTransformer.scala:289)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike.map(TraversableLike.scala:238)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
        at scala.collection.immutable.List.map(List.scala:298)
        at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.batchAnnotate(MarianTransformer.scala:283)
        at com.johnsnowlabs.nlp.LightPipeline.$anonfun$fullAnnotate$1(LightPipeline.scala:49)
        at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
        at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
        at com.johnsnowlabs.nlp.LightPipeline.fullAnnotate(LightPipeline.scala:37)
        at com.johnsnowlabs.nlp.LightPipeline.annotate(LightPipeline.scala:113)
        at com.johnsnowlabs.nlp.LightPipeline.annotateJava(LightPipeline.scala:129)
        at jdk.internal.reflect.GeneratedMethodAccessor94.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:829)

From a pyspark session: dave_aitel@minecraft-1:/minecraft/CODE/DARPA$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3 Python 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/minecraft/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar ) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release :: loading settings :: url = jar:file:/minecraft/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/dave_aitel/.ivy2/cache The jars for the packages stored in: /home/dave_aitel/.ivy2/jars com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-f85b28b3-0bbb-4ddd-88f8-a7c84b47bed3;1.0 confs: [default] found com.johnsnowlabs.nlp#spark-nlp_2.12;3.2.3 in central found com.typesafe#config;1.4.1 in central found org.rocksdb#rocksdbjni;6.5.3 in central found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central found com.github.universal-automata#liblevenshtein;3.0.0 in central found com.google.code.findbugs#annotations;3.0.1 in central found net.jcip#jcip-annotations;1.0 in central found com.google.code.findbugs#jsr305;3.0.1 in central found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central found com.google.code.gson#gson;2.3 in central found it.unimi.dsi#fastutil;7.0.12 in central found org.projectlombok#lombok;1.16.8 in central found org.slf4j#slf4j-api;1.7.21 in central found com.navigamez#greex;1.0 in central found dk.brics.automaton#automaton;1.11-8 in central found org.json4s#json4s-ext_2.12;3.5.3 in central found joda-time#joda-time;2.9.5 in central found org.joda#joda-convert;1.8.1 in central found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 in central found net.sf.trove4j#trove4j;3.0.3 in central :: resolution report :: resolve 644ms :: artifacts dl 16ms :: resolution report :: resolve 644ms :: artifacts dl 16ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.11.603 from central in [default] com.github.universal-automata#liblevenshtein;3.0.0 from central in [default] com.google.code.findbugs#annotations;3.0.1 from central in [default] com.google.code.findbugs#jsr305;3.0.1 from central in [default] com.google.code.gson#gson;2.3 from central in [default] com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default] com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default] com.johnsnowlabs.nlp#spark-nlp_2.12;3.2.3 from central in [default] com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 from central in [default] com.navigamez#greex;1.0 from central in [default] com.typesafe#config;1.4.1 from central in [default] dk.brics.automaton#automaton;1.11-8 from central in [default] it.unimi.dsi#fastutil;7.0.12 from central in [default] joda-time#joda-time;2.9.5 from central in [default] net.jcip#jcip-annotations;1.0 from central in [default] net.sf.trove4j#trove4j;3.0.3 from central in [default] org.joda#joda-convert;1.8.1 from central in [default] org.json4s#json4s-ext_2.12;3.5.3 from central in [default] org.projectlombok#lombok;1.16.8 from central in [default] org.rocksdb#rocksdbjni;6.5.3 from central in [default] org.slf4j#slf4j-api;1.7.21 from central in [default] Welcome to

 / __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/

/ / ./_,// //_\ version 3.1.2 //

Using Python version 3.7.3 (default, Jan 22 2021 20:04:44)

Expected Behavior

Fairly good and quick translation of the French text.

Current Behavior

Translations seem to take a time that is O(N^2) or something with the length of the string! (this is very weird!)

But also, sometimes they just HANG FOREVER CHEWING CPU and require a manual kill.

And then also I got the exception listed above on one of my strings.

Possible Solution

Steps to Reproduce

I have a simple client-server setup for testing. I run lots of text through it. I apologize for my horrible style, etc.

""" sentiment_service.py

TO RUN:

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3

Then cut and paste the following:

import sentiment_service sentiment_service.run_api(spark)

Possible returns: toxicornot: toxic, severe_toxic, identity_hate, insult, obscene, threat sentiment: positive, negative, neutral langid: string of langid translate: translated string

Note: WHY DOES THIS FREEZE OUR TRANSLATOR? "CVE-2019-19781\n... ............."

"""


#
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline 
from com.johnsnowlabs.nlp.annotators.classifier.dl import MultiClassifierDLModel
from com.johnsnowlabs.nlp.embeddings import UniversalSentenceEncoder
from flask import Flask
from flask import request
import json 

localport = 18082

import emoji
banned_chars = "▮◂▸│━↴●ᑌᘚᗩяɑᕮƨD"
def remove_emojis(text):
    "remove all emojis from a string"
    clean_text = emoji.get_emoji_regexp().sub("",text)  
    for char in banned_chars:
        clean_text=clean_text.replace(char,"")
    return clean_text

def print_annotations(annotations):

    for a in annotations[0]["category"]:
        print(a.result)

def get_annotations(annotations):
    ret = []
    for a in annotations[0]["category"]:
        ret+=[a.result]
    return ret 

toxicpipeline = None
sentimentpipeline = None 
langidpipeline = None
translatepipeline = {}

def get_pipelines(spark):
    global toxicpipeline
    global sentimentpipeline
    global langidpipeline
    global translatepipeline 

    document = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    use = UniversalSentenceEncoder.pretrained() \
        .setInputCols(["document"])\
        .setOutputCol("use_embeddings")

    docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic") \
      .setInputCols(["use_embeddings"])\
      .setOutputCol("category")\
      .setThreshold(0.5)

    pipeline = Pipeline(
        stages = [
            document,
            use,
            docClassifier
        ])

    lightPipeline = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
    toxicpipeline=lightPipeline 
    sentimentpipeline = PretrainedPipeline('analyze_sentimentdl_use_twitter', lang = 'en')   

    #our language ID pipeline 
    langidpipeline = PretrainedPipeline("detect_language_375", lang = "xx")  

    #our translation pipeline
    for langid in ["ru","zh","ar","fr","ko", "es"]:
        translatepipeline[langid] = PretrainedPipeline("translate_%s_en"%langid, lang = "xx") 

    return lightPipeline

api = Flask(__name__)

@api.route('/toxicornot', methods=['POST'])
def get_toxic():    
    data = request.json.get("data")
    print("toxicornot data: %s"%data)
    annotator = toxicpipeline.fullAnnotate([data])
    try:        
        result = get_annotations(annotator)
    except IndexError:
        result = ""
        print("*****Exception!")
    print("Toxicornot Result: %s"%result)
    return json.dumps(result)

@api.route('/sentiment', methods=['POST'])
def get_sentiment():
    data = request.json.get("data")
    print("Sentiment data: %s"%data)
    try:        
        annotator = sentimentpipeline.fullAnnotate([data])          
        result = annotator[0]["sentiment"][0].result
    except IndexError:
        print("********Exception!")
        #did not get any results (probably empty string?)
        result = "" 
    print("Sentiment Result: %s"%result)
    return json.dumps(result)    

@api.route('/langid', methods=['POST'])
def get_langid():
    data = request.json.get("data")
    data = remove_emojis(data)
    print("Langid data: %s"%data)
    try:        
        result = langidpipeline.annotate(data)["language"]
    except IndexError:
        #did not get any results (probably empty string?)
        result = "" 
        print("********Exception!")
    print("Langid Result: %s"%result)
    return json.dumps(result)    

@api.route('/translate', methods=['POST'])
def get_translate():
    data = request.json.get("data")
    #emojis do bad bad things
    data = remove_emojis(data)

    if not data:
        return json.dumps("")

    langid = request.json.get("langid")

    #right now we maintain this little bad-chars list - which is terrible
    if langid in ["ko", "zh"]:
        data = data.replace(".","")

    #TODO return error if langid not supported     
    print("Translate [%s] data: %s"%(langid,data))
    try:
        result = "" 
        result += translatepipeline[langid].annotate(chunk)["translation"]
    except IndexError:
        #did not get any results (probably empty string?)
        result = "" 
        print("********Exception!")
    print("Translate Result: %s"%result)
    return json.dumps(result)    

def run_api(spark):
    if not toxicpipeline:
        get_pipelines(spark)
    print("Starting Flask Server on Port %d"%localport)
    api.run(host="0.0.0.0", port=localport)    

if __name__== "__main__":
    run_api()


"""
ToxicOrNot Test Client
"""

import requests

host = "localhost"
port = 18082

supported_translations = ["ru","zh","ar","fr","ko", "es"]

def get_toxicity(data):    
    response = requests.post("http://%s:%d/toxicornot"%(host,port),json={"data": data })
    r = response.json()
    retstring = ",".join(r)
    return retstring 

def get_sentiment(data):
    response = requests.post("http://%s:%d/sentiment"%(host,port),json={"data": data })
    r = response.json()
    retstring = r
    print("Sentiment: %s"%retstring)
    return retstring 

def get_langid(data):
    response = requests.post("http://%s:%d/langid"%(host,port),json={"data": data })
    r = response.json()
    if len(r):
        retstring = r[0]
    else:
        retstring = "Unknown"
    #print("Langid: %s"%retstring)
    return retstring     

def get_translation(data, langid):
    response = requests.post("http://%s:%d/translate"%(host,port),json={"data": data, "langid": langid })
    r = response.json()
    if len(r):
        retstring = r[0]
    else:
        retstring = ""
    #print("Translation: %s"%retstring)
    return retstring     

def main():
    data = "I will eat you alive!"
    print("asking about: %s"%data)
    response = get_toxicity(data)
    print("Got back: %s"%response)

    data = "这是个好日子。"
    print("IDing and Translating %s"%data)
    response = get_langid(data)
    print("LangId: %s"%response)
    langid = response
    response = get_translation(data, langid)
    print("Got translation: %s"%(response))
    return 

if __name__ == "__main__":
    main()

I'm unsure why the string: Main Account: @Zorto_ | Since 2009. | Producer | Programmer | zorto@mail.com Xbox: infirmary caused the error.

1. 2. 3. 4.

Context

Your Environment

Spark NLP version sparknlp.version():

import sparknlp sparknlp.version() '3.3.2'
Apache NLP version spark.version:

spark.version '3.1.2'
Java version java -version:
java --version openjdk 11.0.12 2021-07-20 OpenJDK Runtime Environment (build 11.0.12+7-post-Debian-2deb10u1) OpenJDK 64-Bit Server VM (build 11.0.12+7-post-Debian-2deb10u1, mixed mode, sharing)
Setup and installation (Pypi, Conda, Maven, etc.): Just using PIP3 for my packages.
Operating System and version:
$ cat /etc/issue Debian GNU/Linux 10 \n \l
Link to your project (if any):

daveaitel commented 2 years ago

Here is another crash (translating an empty string in German to English using Marian) (which is weird because I thought i filtered empty strings):

ret = translatepipeline[langid].annotate(data)["translation"]

File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 181, in annotate return pipeline.annotate(target) File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/base.py", line 165, in annotate annotations = self._lightPipeline.annotateJava(target) File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o1482.annotateJava. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:41) at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike.head(IterableLike.scala:109) at scala.collection.IterableLike.head$(IterableLike.scala:108) at scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:38) at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129) at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129) at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:38) at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.generateSeq2Seq(TensorflowMarian.scala:247) at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.$anonfun$batchAnnotate$3(MarianTransformer.scala:289) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.batchAnnotate(MarianTransformer.scala:283) at com.johnsnowlabs.nlp.LightPipeline.$anonfun$fullAnnotate$1(LightPipeline.scala:49) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at com.johnsnowlabs.nlp.LightPipeline.fullAnnotate(LightPipeline.scala:37) at com.johnsnowlabs.nlp.LightPipeline.annotate(LightPipeline.scala:113) at com.johnsnowlabs.nlp.LightPipeline.annotateJava(LightPipeline.scala:129) at jdk.internal.reflect.GeneratedMethodAccessor101.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:829)

maziyarpanahi commented 2 years ago

Thank you for reporting. For the first issue, we are trying to reproduce and see what can cause this in the input and how to avoid it.

at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.generateSeq2Seq(TensorflowMarian.scala:247)

Your second issue is different though. It seems the pipeline was not saved with a default langId and it cannot find it in the input.

Could you please tell me exactly the name of the pipeline that resulted in this error so I can check?

daveaitel commented 2 years ago

127.0.0.1 - - [09/Nov/2021 04:48:11] "POST /translate HTTP/1.1" 200 -
Translate [da] data:
[2021-11-09 04:48:11,282] ERROR in app: Exception on /translate [POST]

from


    #our translation pipeline

    for langid in supported_translations:
        translatepipeline[langid] = PretrainedPipeline("translate_%s_en"%langid, lang = "xx")

:)

Thanks!

daveaitel commented 2 years ago

(I'm doing it this way because the multi-lang marian just doesn't work as far as I can tell)

maziyarpanahi commented 2 years ago

OK, I'll test all these pipelines to make sure not having a default langId in the MarianTransformer is not resulting in an exception. "ru","zh","ar","fr","ko", "es"

daveaitel commented 2 years ago

Don't forget "da" !

On Tue, Nov 9, 2021, 12:55 PM Maziyar Panahi @.***> wrote:

OK, I'll test all these pipelines to make sure not having a default langId in the MarianTransformer is not resulting in an exception. "ru","zh","ar","fr","ko", "es"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JohnSnowLabs/spark-nlp/issues/6432#issuecomment-964393749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE25MYWHXVCENCPGH4K6IBLULFOBBANCNFSM5HSWO62Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

daveaitel commented 2 years ago

Ok, I replicated it cleanly by translating a single space under the Italian to English pipeline.

daveaitel commented 2 years ago

This string gets a langid of Italian and then freezes the italian translation pipeline and then generates the first traceback eventually: "RT RadioRedAzione\nBASTA CON I SOLITI CAPOANNI\n\nINIE è BELLO Capodanno"

maziyarpanahi commented 2 years ago

Hi @daveaitel I have found the issue and it will be fixed in the next release. (your second issue, for the first issue with org.tensorflow.exceptions.TFInvalidArgumentException @wolliq is working on it)

daveaitel commented 2 years ago

That's great news! If there's any easy way for me to update to the fixed version so I can help you test please let me know!

I have a clean replication for the first traceback in Arabic if @wolliq needs it - Translate [ar] data: [متوفر كميه من كابل شحن وداتا معدن مصنوع من مادة الالومنيوم قوي جدا و مش بيتقطع وسريع الشحن وهيعيش معاك طول العمر بيدعم الشحن السريع fast charge بيدعم نقل الداتا بسرعه مصنوع منخامه قوية جدا ضد القطع او الشد تماما شكل شيك جدا جدا ]

daveaitel commented 2 years ago

Also I'd love to know if there are any workarounds I can apply - I will be happy to do so!

daveaitel commented 2 years ago

(Note that my brackets are NOT part of the replication string:


    #TODO return error if langid not supported     
    print("Translate [%s] data: [%s]"%(langid,data))
    try:
        result = "" 
        ret = translatepipeline[langid].annotate(data)["translation"]
        print("Ret = %s"%repr(ret))
        result += "".join(ret)
    except IndexError:
        #did not get any results (probably empty string?)
        result = "" 
        print("********Exception!")
    print("Translate Result: %s"%result)

daveaitel commented 2 years ago

FWIW I can cause the first issue with lots of different languages (ZH included). These threads spin the CPU for a long time before they cause that exception.

maziyarpanahi commented 2 years ago

Thanks @daveaitel

@wolliq will look into this and once we both have a candidate next week we can ask you to test it for us before we release it

daveaitel commented 2 years ago

If anyone ever figures out how to get the multi-translator working that would be GREAT news to my memory usage , btw :)

maziyarpanahi commented 2 years ago

I'll see if we can revive opus-mt-mul-en models in Spark NLP as well in the next release.

maziyarpanahi commented 2 years ago

Fixed in Spark NLP 3.3.3 release

JohnSnowLabs / spark-nlp

Crash when trying to translate using Marian #6432

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

TO RUN:

Then cut and paste the following:

Context

Your Environment