Training scispacy pipelines require recreating the vocab file

Hammad-NobleAI commented 2 years ago

I'm attempting to use your "en_core_sci_lg" pipeline to extract chemical entities from documents, and then using those entities as a basis to train Spacy's Entity Linker (as shown in this document). Here are the relevant portions of my code:

import spacy
import scispacy
nlp = spacy.load("en_core_sci_lg")

... prepare training documentation as Spacy specified in the form [tuples of form (text, {"links": (span.start, span.end), {qID: probability})]...

entity_linker = nlp.create_pipe("entity_linker", config={"incl_prior": False})

def create_kb(vocab):
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=200)

    for qid, desc in desc_dict.items():
        desc_doc = nlp(desc)
        desc_enc = desc_doc.vector
        kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)
    return kb

entity_linker.set_kb(create_kb)
nlp.add_pipe("entity_linker", last=True)

from random import random
from spacy.util import minibatch, compounding
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
with nlp.disable_pipes(*other_pipes):   # train only the entity_linker
        optimizer = nlp.begin_training() ## ERROR HERE
        for itn in range(500):   # 500 iterations takes about a minute to train on this small dataset
            random.shuffle(TRAIN_DOCS)
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))   # increasing batch size
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,
                    annotations,
                    drop=0.2,   # prevent overfitting
                    losses=losses,
                    sgd=optimizer,
                )
            if itn % 50 == 0:
                print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

When I get to the error line (commented towards the end of the code block), I get the following error:

RegistryError: [E893] Could not find function 'replace_tokenizer' in function registry 'callbacks'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy.copy_from_base_model.v1, spacy.models_and_pipes_with_nvtx_range.v1, spacy.models_with_nvtx_range.v1

I'm running on Mac OS 12.4, M1 Pro, 16 GB unified memory. Scispacy==0.5.0, spacy==3.2.4. Are Scispacy models compatible with this workflow, or is that something that hasn't/won't be implemented? Thanks in advance!