allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.7k stars 227 forks source link

Training scispacy pipelines require recreating the vocab file #440

Open Hammad-NobleAI opened 2 years ago

Hammad-NobleAI commented 2 years ago

I'm attempting to use your "en_core_sci_lg" pipeline to extract chemical entities from documents, and then using those entities as a basis to train Spacy's Entity Linker (as shown in this document). Here are the relevant portions of my code:

import spacy
import scispacy
nlp = spacy.load("en_core_sci_lg")

... prepare training documentation as Spacy specified in the form [tuples of form (text, {"links": (span.start, span.end), {qID: probability})]...

entity_linker = nlp.create_pipe("entity_linker", config={"incl_prior": False})

def create_kb(vocab):
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=200)

    for qid, desc in desc_dict.items():
        desc_doc = nlp(desc)
        desc_enc = desc_doc.vector
        kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)
    return kb

entity_linker.set_kb(create_kb)
nlp.add_pipe("entity_linker", last=True)

from random import random
from spacy.util import minibatch, compounding
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
with nlp.disable_pipes(*other_pipes):   # train only the entity_linker
        optimizer = nlp.begin_training() ## ERROR HERE
        for itn in range(500):   # 500 iterations takes about a minute to train on this small dataset
            random.shuffle(TRAIN_DOCS)
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))   # increasing batch size
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,
                    annotations,
                    drop=0.2,   # prevent overfitting
                    losses=losses,
                    sgd=optimizer,
                )
            if itn % 50 == 0:
                print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

When I get to the error line (commented towards the end of the code block), I get the following error:

RegistryError: [E893] Could not find function 'replace_tokenizer' in function registry 'callbacks'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy.copy_from_base_model.v1, spacy.models_and_pipes_with_nvtx_range.v1, spacy.models_with_nvtx_range.v1

I'm running on Mac OS 12.4, M1 Pro, 16 GB unified memory. Scispacy==0.5.0, spacy==3.2.4. Are Scispacy models compatible with this workflow, or is that something that hasn't/won't be implemented? Thanks in advance!

dakinggg commented 2 years ago

Can you try adding a from scispacy.base_project_code import * to the top of your file?

Hammad-NobleAI commented 2 years ago

Thanks for getting back to me. I tried that, and it seems to have got beyond that issue now, but has led into this:

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1249, in Language.begin_training(self, get_examples, sgd)
   1242 def begin_training(
   1243     self,
   1244     get_examples: Optional[Callable[[], Iterable[Example]]] = None,
   1245     *,
   1246     sgd: Optional[Optimizer] = None,
   1247 ) -> Optimizer:
   1248     warnings.warn(Warnings.W089, DeprecationWarning)
-> 1249     return self.initialize(get_examples, sgd=sgd)

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1286, in Language.initialize(self, get_examples, sgd)
   1284     before_init(self)
   1285 try:
-> 1286     init_vocab(
   1287         self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"]
   1288     )
...
     23 if require_exists and not location.exists():
---> 24     raise ValueError(f"Can't read file: {location}")
     25 return location

ValueError: Can't read file: project_data/vocab_lg.jsonl
dakinggg commented 2 years ago

Ok, I think you are working from an outdated example, because the begin_training function is deprecated (https://spacy.io/api/language#initialize). If you want to write your own training loop, you will probably need to look deeper into how spacy does it in the train CLI. That being said, you should probably use their config system and CLI for training as much as possible. Check out project.yml and the configs here https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson. All that being said, I think this is also a question about spacy, not scispacy, as I think you will get similar errors if you run your script using en_core_web_md, so further questions are probably better directed to the spacy folks. Feel free to reopen if it ends up being scispacy specific.

dakinggg commented 2 years ago

Edit: looks like the base spacy models don't have this issue, so it is something more specific. I think it might still be a question for the spacy folks, but first you should try using the config system and CLI.

dakinggg commented 2 years ago

If it turns out you do just need that vocab file to continue, you can probably recreate it from the en_core_sci_lg model somehow, but you can definitely also just create it the same way that we do. See the convert-lg command in our project.yml.

dakinggg commented 2 years ago

see #450 for a workaround