allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.71k stars 229 forks source link

Usage of entity_linker with pyspark on databricks #409

Closed Ibrokhimsadikov closed 2 years ago

Ibrokhimsadikov commented 2 years ago

I am trying to use entity_linker with pyspark and wanted to register it as an UDF function, with the following

from pyspark.sql.types import ArrayType, FloatType, StringType
import spacy
def getUmls(text):
    html = markdown(text)

    # remove code snippets
    html = re.sub(r'<pre>(.*?)</pre>', ' ', html)
    html = re.sub(r'<code>(.*?)</code >', ' ', html)
    html = re.sub(r'(.)\1{5,}', ' ', html)
    # extract text
    soup = BeautifulSoup(html, "html.parser")
    text1 = ' '.join(soup.findAll(text=True))
    text1 = " ".join(text1.split())
    doc=nlp_sci(text1)
    umls=[ent_id for ent in linker(doc).ents for ent_id, score in ent._.umls_ents]
    return umls

spark.udf.register("getUmls", getUmls, ArrayType(StringType()))

Getting the following error:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'nmslib.dist.FloatIndex' object
PicklingError: Could not serialize object: TypeError: cannot pickle 'nmslib.dist.FloatIndex' object

Is the work around for such issue

dakinggg commented 2 years ago

Hi, sorry this seems more like a pyspark/databricks question. The entity linker is indeed not serializable and I'm not sure what a workaround might be because I am not an expert with pyspark/databricks. You may have better luck searching elsewhere for how to use pyspark/databricks with a UDF that is not serializable.