Closed clabornd closed 1 year ago
Hi @clabornd
Could you please update your Spark NLP to spark-nlp==4.4.3
? We have introduced optimizations for both speed and memory with some code enhancements/bug-fixes:
https://colab.research.google.com/drive/1KucyhiPBc5Eivkiyx94_VFaba8K1bvoV?usp=sharing
Thanks for the fast response, I tried upgrading to spark-nlp==4.4.3
and the issue persists. I'm running this on Databricks if that's relevant. I tested with runtimes 13.0 and 12.2 and also with a single node machine to no avail, error looks to be the same.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 27) (10.139.64.6 executor driver): org.tensorflow.exceptions.TFInvalidArgumentException: indices[1024] = 1026 is not in [0, 1026)
[[{{function_node __inference_encoder_serving_912071}}{{node encoder/embed_positions/embedding_lookup}}]]
at org.tensorflow.internal.c_api.AbstractTF_Status.throwExceptionIfNotOK(AbstractTF_Status.java:87)
...
You are welcome. I cannot reproduce this issue (on any platform). It seems you just updated the PyPI package (which is just empty APIs). The actual logic of the library is in the Maven package. Could you please follow this instruction and make sure the actual spark-nlp
Maven dependency is also 4.4.3
?
https://github.com/JohnSnowLabs/spark-nlp#databricks-cluster
You can also share a screenshot from your Library
tab in Cluster configuration in case everything is 4.4.3
and still not working. (mine is 4.4.3 and it works)
Sorry didn't mention I also updated the Maven package. My Libraries tab looks like:
Also tried uninstalling everything except the sparknlp PyPI/Maven packages.
Cluster config in case its useful:
{
"autoscale": {
"min_workers": 1,
"max_workers": 8
},
"cluster_name": "memory-odbc",
"spark_version": "13.0.x-scala2.12",
"spark_conf": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryoserializer.buffer.max": "2000M",
"spark.sql.broadcastTimeout": "40000",
"spark.databricks.delta.preview.enabled": "true"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS13_v2",
"driver_node_type_id": "Standard_DS13_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {},
"autotermination_minutes": 120,
"enable_elastic_disk": true,
"cluster_source": "UI",
"init_scripts": [],
"enable_local_disk_encryption": false,
"runtime_engine": "STANDARD",
"cluster_id": "0526-205224-3pve3hjz"
}
Ok I am seeing that error on the Colab notebook as well now, it should pop up if you change display()
to an action: pipeline_model.transform(data).show()
instead of display(...)
. (The display()
in Databricks is different than IPython.display.display
)
The text length that triggers this is 1025 tokens, and lots of BART versions have max context length of 1024, is this not just an issue with max context length?
The text length that triggers this is 1025 tokens, and lots of BART versions have max context length of 1024, is this not just an issue with max context length? I see now. This is actually a bug and we should truncate anything longer than 1024 internally since there is no
setMaxInputLength
to throw an error to users like we do with BERT (the limit is 1024)
I thought we were doing that internally. This is a bug and will be fixed in the next release.
Is there an existing issue for this?
What are you working on?
I am trying to summarize potentially long texts with distilbart_xsum_12_6.
Current Behavior
Currently I get an error on long texts:
(full error at the bottom)
Expected Behavior
Maybe not expected behavior, but I always assumed something was going on under the hood to handle texts longer than a particular model's max context length. Does no such mechanism exist for BART, or in general?
Steps To Reproduce
A rough example:
Spark NLP version and Apache Spark
sparknlp 4.4.0 spark 3.3.2