JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.76k stars 703 forks source link

org.tensorflow.exceptions.TFInvalidArgumentException: indices[0,11] = 28937 is not in [0, 21128) #14277

Open xueyuan1990 opened 1 month ago

xueyuan1990 commented 1 month ago

Is there an existing issue for this?

Current Behavior

BertEmbeddings.pretrained()can load successfully. But when I run BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh") , I get the exception:

2024-05-24 17:23:24.614208: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Exception in thread "main" org.tensorflow.exceptions.TFInvalidArgumentException: indices[0,11] = 28937 is not in [0, 21128)
         [[{{node bert/embeddings/Gather}}]]
        at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:87)
        at org.tensorflow.Session.run(Session.java:850)
        at org.tensorflow.Session.access$300(Session.java:82)
        at org.tensorflow.Session$Runner.runHelper(Session.java:552)
        at org.tensorflow.Session$Runner.runNoInit(Session.java:499)
        at org.tensorflow.Session$Runner.run(Session.java:495)
        at com.johnsnowlabs.ml.ai.Bert.tag(Bert.scala:176)
        at com.johnsnowlabs.ml.ai.Bert.sessionWarmup(Bert.scala:77)
        at com.johnsnowlabs.ml.ai.Bert.<init>(Bert.scala:86)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.setModelIfNotSet(BertEmbeddings.scala:267)
        at com.johnsnowlabs.nlp.embeddings.ReadBertDLModel.readModel(BertEmbeddings.scala:432)
        at com.johnsnowlabs.nlp.embeddings.ReadBertDLModel.readModel$(BertEmbeddings.scala:427)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.readModel(BertEmbeddings.scala:492)
        at com.johnsnowlabs.nlp.embeddings.ReadBertDLModel.$anonfun$$init$$1(BertEmbeddings.scala:444)
        at com.johnsnowlabs.nlp.embeddings.ReadBertDLModel.$anonfun$$init$$1$adapted(BertEmbeddings.scala:444)
        at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50)
        at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49)
        at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61)
        at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61)
        at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38)
        at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
        at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:515)
        at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:507)
        at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44)
        at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedBertModel$$super$pretrained(BertEmbeddings.scala:492)
        at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained(BertEmbeddings.scala:418)
        at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained$(BertEmbeddings.scala:417)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:492)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:492)
        at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:47)
        at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:47)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedBertModel$$super$pretrained(BertEmbeddings.scala:492)
        at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained(BertEmbeddings.scala:415)
        at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained$(BertEmbeddings.scala:414)
        at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:492)
        at com.algo.recom.article_recommender.v20240511.test_spark_nlp$.main(test_spark_nlp.scala:7)
        at com.algo.recom.article_recommender.v20240511.test_spark_nlp.main(test_spark_nlp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Expected Behavior

Download model successfully.

Steps To Reproduce

import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh") 

spark-submit :

#!/bin/bash
jar_file="./article_recommender_spark3-2.0-SNAPSHOT.jar"
class_name="com.algo.recom.article_recommender.v20240511.test_spark_nlp"
/opt/spark3/bin/spark-submit \
--name action_sequence_123 \
--master local[1] \
--files /opt/spark3/conf/hive-site.xml \
--class $class_name \
--jars hdfs:///apps/recommend/models/jars/xueyuan/mzreader/spark-nlp-assembly-5.3.3.jar \
$jar_file

Spark NLP version and Apache Spark

CentOS Linux release 8.4.2105 spark version 2.2.1 Scala version 2.11.8 java version 1.8.0_144 sparknlp : I use the Fat JAR downloaded from https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.3.jar.

Confirm CPU instructions(AVX2 AVX512F FMA):

lscpu | grep -i -e AVX512F -i -e AVX2 -i -e FMA 
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 arat pku ospke
maziyarpanahi commented 1 month ago

This is an issue with the model as I explained in the other thread. suggest you either found another model with Chinese support, try to import the same model yourself with ONNX, or import another model yourself:

Import new model(s):

xueyuan1990 commented 1 month ago

OK, thanks for your help.