allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.7k stars 227 forks source link

Trouble training custom NER model for en_core_sci_lg - "ValueError: Can't read file: project_data/vocab_lg.jsonl" #450

Closed Jason-B-Jiang closed 2 years ago

Jason-B-Jiang commented 2 years ago

Hello,

I have been trying to train a new NER model for the en_core_sci_lg pipeline, freezing all the other pipeline components during training. I adapted a script from Explosion (https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_replace/scripts/create_config.py) to generate a config file that only enables NER for training while freezing everything else.

I could train a NER model for spacy's en_core_web_lg pipeline using the generated config file, but I ran into this error when I used the config file for en_core_sci_lg:

✔ Created output directory: en_core_sci_lg_model ℹ Saving to output directory: en_core_sci_lg_model ℹ Using CPU

=========================== Initializing pipeline =========================== /home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_sci_lg' (0.5.0) was trained with spaCy v3.2 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate warnings.warn(warn_msg) [2022-09-08 09:23:31,601] [INFO] Set up nlp object from config [2022-09-08 09:23:31,627] [INFO] Pipeline: ['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner'] [2022-09-08 09:23:31,627] [INFO] Resuming training for: ['ner'] [2022-09-08 09:23:31,654] [INFO] Copying tokenizer from: en_core_sci_lg /home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/util.py:865: UserWarning: [W095] Model 'en_core_sci_lg' (0.5.0) was trained with spaCy v3.2 and may not be 100% compatible with the current version (3.4.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate warnings.warn(warn_msg) [2022-09-08 09:23:47,825] [INFO] Copying vocab from: en_core_sci_lg Traceback (most recent call last): File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/main.py", line 4, in setup_cli() File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/cli/_util.py", line 71, in setup_cli command(prog_name=COMMAND) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper return callback(use_params) # type: ignore File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/cli/train.py", line 45, in train_cli train(config_path, output_path, use_gpu=use_gpu, overrides=overrides) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/cli/train.py", line 72, in train nlp = init_nlp(config, use_gpu=use_gpu) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/training/initialize.py", line 84, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/language.py", line 1295, in initialize init_vocab( File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/spacy/training/initialize.py", line 118, in init_vocab for attrs in lex_attrs: File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/srsly/_json_api.py", line 109, in read_jsonl file_path = force_path(path) File "/home/boognish/mambaforge/envs/microsporidia_nlp/lib/python3.9/site-packages/srsly/util.py", line 24, in force_path raise ValueError(f"Can't read file: {location}") ValueError: Can't read file: project_data/vocab_lg.jsonl

Here are the steps to recreate this error: 1) Download the code + data files from my github repo with this link: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction

2) Extract the compressed archive and change directory into the folder (should be named 'microsp_host_relation_extraction')

3) Run the following commands in the command line

create new conda environment to install spacy, scispacy and en_core_sci_lg

conda create --name scispacy_test conda activate scispacy_test pip install spacy pip install scispacy pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_lg-0.5.1.tar.gz

create training config file for en_core_sci_lg

python3 ./generate_training_config.py en_core_sci_lg ner config_en_core_sci_lg.cfg

train en_core_sci_lg NER model using the given training and validation data

python3 -m spacy train ./config_en_core_sci_lg.cfg –output en_core_sci_lg_model –paths.train ./train.spacy –paths.dev ./valid.spacy

Thank you so much for reading this! I feel like I don't know enough about spaCy to troubleshoot this. I have tried googling and reading through spaCy's docs, but still can't figure out a solution.

Cheers, Jason

dakinggg commented 2 years ago

Hi, so this is an issue with something I have done wrong with including the vocab file that I haven't had time to figure out properly. That being said, we should be able to work around it.

You'll need to create that file, and put it in the location it says (project_data folder). The command to create the file is here https://github.com/allenai/scispacy/blob/e30b8f4ce44460ee65c97250f4c368a15f8c8542/project.yml#L240. To run that command you will also need to download the frequency file. The command for that is here https://github.com/allenai/scispacy/blob/e30b8f4ce44460ee65c97250f4c368a15f8c8542/project.yml#L164. After doing those two things you should have the missing file, and be able to continue. Hopefully I'll figure out the "correct" solution at some point, but please do let me know if this works for you or not.

Side note: I notice a warning about using different versions of spacy. You should make sure the version of spacy you are using is the right version for the version of scispacy you are using. So, either upgrade scispacy or downgrade spacy.

Jason-B-Jiang commented 2 years ago

Hi Daniel, thanks for the response! I tried your suggestions and the training config runs fine now (i.e: I can train without error using the generated config).

In case anyone runs into a similar problem in the future, here are instructions on how to reproduce the solution with my data.

1) Run all the instructions I've given in my post, setting up the conda environment and downloading my data/code files

2) Create a new folder called "project_data" in the microsp_host_relation_extraction folder

3) Install the AWS command line interface w/ instructions from here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

4) Download the frequency file with AWS (you can find the variable values in project.yml): aws s3 cp s3://ai2-s2-scispacy/data/gorc_subset.freqs assets/gorc_subset.freqs –no-sign-request

5) Download convert_freqs.py with https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/allenai/scispacy/blob/e30b8f4ce44460ee65c97250f4c368a15f8c8542/scripts/convert_freqs.py, and copy it to the microsp_host_relation_extraction folder

6) Create the missing vocab_lg.jsonl file: python convert_freqs.py --input_path assets/gorc_subset.freqs --output_path project_data/vocab_lg.jsonl --min_word_frequency 50

Hopefully you can generalize this fix to your problem too.

Cheers!

dakinggg commented 2 years ago

Thanks for the detailed report!