Closed daveaitel closed 2 years ago
Here is another crash (translating an empty string in German to English using Marian) (which is weird because I thought i filtered empty strings):
ret = translatepipeline[langid].annotate(data)["translation"]
File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/pretrained.py", line 181, in annotate return pipeline.annotate(target) File "/home/dave_aitel/.local/lib/python3.7/site-packages/sparknlp/base.py", line 165, in annotate annotations = self._lightPipeline.annotateJava(target) File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o1482.annotateJava. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:41) at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike.head(IterableLike.scala:109) at scala.collection.IterableLike.head$(IterableLike.scala:108) at scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:38) at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129) at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129) at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:38) at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.generateSeq2Seq(TensorflowMarian.scala:247) at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.$anonfun$batchAnnotate$3(MarianTransformer.scala:289) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer.batchAnnotate(MarianTransformer.scala:283) at com.johnsnowlabs.nlp.LightPipeline.$anonfun$fullAnnotate$1(LightPipeline.scala:49) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at com.johnsnowlabs.nlp.LightPipeline.fullAnnotate(LightPipeline.scala:37) at com.johnsnowlabs.nlp.LightPipeline.annotate(LightPipeline.scala:113) at com.johnsnowlabs.nlp.LightPipeline.annotateJava(LightPipeline.scala:129) at jdk.internal.reflect.GeneratedMethodAccessor101.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:829)
Thank you for reporting. For the first issue, we are trying to reproduce and see what can cause this in the input and how to avoid it.
at com.johnsnowlabs.ml.tensorflow.TensorflowMarian.generateSeq2Seq(TensorflowMarian.scala:247)
Your second issue is different though. It seems the pipeline was not saved with a default langId and it cannot find it in the input.
Could you please tell me exactly the name of the pipeline that resulted in this error so I can check?
127.0.0.1 - - [09/Nov/2021 04:48:11] "POST /translate HTTP/1.1" 200 -
Translate [da] data:
[2021-11-09 04:48:11,282] ERROR in app: Exception on /translate [POST]
from
#our translation pipeline
for langid in supported_translations:
translatepipeline[langid] = PretrainedPipeline("translate_%s_en"%langid, lang = "xx")
:)
Thanks!
(I'm doing it this way because the multi-lang marian just doesn't work as far as I can tell)
OK, I'll test all these pipelines to make sure not having a default langId in the MarianTransformer is not resulting in an exception.
"ru","zh","ar","fr","ko", "es"
Don't forget "da" !
On Tue, Nov 9, 2021, 12:55 PM Maziyar Panahi @.***> wrote:
OK, I'll test all these pipelines to make sure not having a default langId in the MarianTransformer is not resulting in an exception. "ru","zh","ar","fr","ko", "es"
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JohnSnowLabs/spark-nlp/issues/6432#issuecomment-964393749, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE25MYWHXVCENCPGH4K6IBLULFOBBANCNFSM5HSWO62Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Ok, I replicated it cleanly by translating a single space under the Italian to English pipeline.
This string gets a langid of Italian and then freezes the italian translation pipeline and then generates the first traceback eventually: "RT RadioRedAzione\nBASTA CON I SOLITI CAPOANNI\n\nINIE è BELLO Capodanno"
Hi @daveaitel
I have found the issue and it will be fixed in the next release. (your second issue, for the first issue with org.tensorflow.exceptions.TFInvalidArgumentException
@wolliq is working on it)
That's great news! If there's any easy way for me to update to the fixed version so I can help you test please let me know!
I have a clean replication for the first traceback in Arabic if @wolliq needs it - Translate [ar] data: [متوفر كميه من كابل شحن وداتا معدن مصنوع من مادة الالومنيوم قوي جدا و مش بيتقطع وسريع الشحن وهيعيش معاك طول العمر بيدعم الشحن السريع fast charge بيدعم نقل الداتا بسرعه مصنوع منخامه قوية جدا ضد القطع او الشد تماما شكل شيك جدا جدا ]
Also I'd love to know if there are any workarounds I can apply - I will be happy to do so!
(Note that my brackets are NOT part of the replication string:
#TODO return error if langid not supported
print("Translate [%s] data: [%s]"%(langid,data))
try:
result = ""
ret = translatepipeline[langid].annotate(data)["translation"]
print("Ret = %s"%repr(ret))
result += "".join(ret)
except IndexError:
#did not get any results (probably empty string?)
result = ""
print("********Exception!")
print("Translate Result: %s"%result)
FWIW I can cause the first issue with lots of different languages (ZH included). These threads spin the CPU for a long time before they cause that exception.
Thanks @daveaitel
@wolliq will look into this and once we both have a candidate next week we can ask you to test it for us before we release it
If anyone ever figures out how to get the multi-translator working that would be GREAT news to my memory usage , btw :)
I'll see if we can revive opus-mt-mul-en
models in Spark NLP as well in the next release.
Fixed in Spark NLP 3.3.3 release
A exception results from trying to translate a French string through Apache NLP and the Marian models.
Description
From a pyspark session: dave_aitel@minecraft-1:/minecraft/CODE/DARPA$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3 Python 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/minecraft/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar ) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release :: loading settings :: url = jar:file:/minecraft/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/dave_aitel/.ivy2/cache The jars for the packages stored in: /home/dave_aitel/.ivy2/jars com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-f85b28b3-0bbb-4ddd-88f8-a7c84b47bed3;1.0 confs: [default] found com.johnsnowlabs.nlp#spark-nlp_2.12;3.2.3 in central found com.typesafe#config;1.4.1 in central found org.rocksdb#rocksdbjni;6.5.3 in central found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central found com.github.universal-automata#liblevenshtein;3.0.0 in central found com.google.code.findbugs#annotations;3.0.1 in central found net.jcip#jcip-annotations;1.0 in central found com.google.code.findbugs#jsr305;3.0.1 in central found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central found com.google.code.gson#gson;2.3 in central found it.unimi.dsi#fastutil;7.0.12 in central found org.projectlombok#lombok;1.16.8 in central found org.slf4j#slf4j-api;1.7.21 in central found com.navigamez#greex;1.0 in central found dk.brics.automaton#automaton;1.11-8 in central found org.json4s#json4s-ext_2.12;3.5.3 in central found joda-time#joda-time;2.9.5 in central found org.joda#joda-convert;1.8.1 in central found com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 in central found net.sf.trove4j#trove4j;3.0.3 in central :: resolution report :: resolve 644ms :: artifacts dl 16ms :: resolution report :: resolve 644ms :: artifacts dl 16ms :: modules in use: com.amazonaws#aws-java-sdk-bundle;1.11.603 from central in [default] com.github.universal-automata#liblevenshtein;3.0.0 from central in [default] com.google.code.findbugs#annotations;3.0.1 from central in [default] com.google.code.findbugs#jsr305;3.0.1 from central in [default] com.google.code.gson#gson;2.3 from central in [default] com.google.protobuf#protobuf-java;3.0.0-beta-3 from central in [default] com.google.protobuf#protobuf-java-util;3.0.0-beta-3 from central in [default] com.johnsnowlabs.nlp#spark-nlp_2.12;3.2.3 from central in [default] com.johnsnowlabs.nlp#tensorflow-cpu_2.12;0.3.3 from central in [default] com.navigamez#greex;1.0 from central in [default] com.typesafe#config;1.4.1 from central in [default] dk.brics.automaton#automaton;1.11-8 from central in [default] it.unimi.dsi#fastutil;7.0.12 from central in [default] joda-time#joda-time;2.9.5 from central in [default] net.jcip#jcip-annotations;1.0 from central in [default] net.sf.trove4j#trove4j;3.0.3 from central in [default] org.joda#joda-convert;1.8.1 from central in [default] org.json4s#json4s-ext_2.12;3.5.3 from central in [default] org.projectlombok#lombok;1.16.8 from central in [default] org.rocksdb#rocksdbjni;6.5.3 from central in [default] org.slf4j#slf4j-api;1.7.21 from central in [default] Welcome to
/ / ./_,// //_\ version 3.1.2 //
Using Python version 3.7.3 (default, Jan 22 2021 20:04:44)
Expected Behavior
Fairly good and quick translation of the French text.
Current Behavior
Translations seem to take a time that is O(N^2) or something with the length of the string! (this is very weird!)
But also, sometimes they just HANG FOREVER CHEWING CPU and require a manual kill.
And then also I got the exception listed above on one of my strings.
Possible Solution
Steps to Reproduce
I have a simple client-server setup for testing. I run lots of text through it. I apologize for my horrible style, etc.
""" sentiment_service.py
TO RUN:
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3
Then cut and paste the following:
import sentiment_service sentiment_service.run_api(spark)
Possible returns: toxicornot: toxic, severe_toxic, identity_hate, insult, obscene, threat sentiment: positive, negative, neutral langid: string of langid translate: translated string
Note: WHY DOES THIS FREEZE OUR TRANSLATOR? "CVE-2019-19781\n... ............."
"""
I'm unsure why the string: Main Account: @Zorto_ | Since 2009. | Producer | Programmer | zorto@mail.com Xbox: infirmary caused the error.
1. 2. 3. 4.
Context
Your Environment
Spark NLP version
sparknlp.version()
:Apache NLP version
spark.version
:Java version
java -version
:java --version openjdk 11.0.12 2021-07-20 OpenJDK Runtime Environment (build 11.0.12+7-post-Debian-2deb10u1) OpenJDK 64-Bit Server VM (build 11.0.12+7-post-Debian-2deb10u1, mixed mode, sharing)
Setup and installation (Pypi, Conda, Maven, etc.): Just using PIP3 for my packages.
Operating System and version:
$ cat /etc/issue Debian GNU/Linux 10 \n \l
Link to your project (if any):