Use individual spaCy with transformer backbones for all NER models

FrancescoCasalegno commented 3 years ago

🚀 Feature

In light of what we discussed in PR #328, and in particular looking at the results shown in this comparison table, we should operate the following changes to the NER models in data_and_models/pipelines/ner.

[x] All NER models should use a transformer backbone. Now they are using tok2vec.
[x] All NER models should initialize the weights of this backbone using the pre-trained weights of CORD19 NLI+STS v1. Also, the weights should not be frozen during the fine-tuning on the NER task.
[x] There should be one distinct NER model (= spaCy pipeline) for each entity type we support. Note that currently this is not the case, as e.g. model2 is used to extract 3 different entity types (see table here).
[x] Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component (Note: for the moment let's keep using add_er.py, then in the future we'll improve that with #310).
[x] All the evaluation results (token and entity based, Prec, Rec, F1) obtained before and after this PR should be collected in a table.

pafonta commented 3 years ago

Regarding this point:

Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component.

An existing issue is about the same: #310.

FrancescoCasalegno commented 3 years ago

As this issue is addressed, there is also the following action to be done (see also PR #345).

[ ] remove from requirements
- en_ner_bc5cdr_md
- en_ner_bionlp13cg_md
- en_ner_craft_md
- en_ner_jnlpba_md

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

Before moving further with this task, I think some points should be clarified:

Are all the checkboxes in the issue description expected to be done in this issue, which implies, in one pull request? Because that is a lot of work. The current issue could be split into several incremental pieces of work. As said in https://github.com/BlueBrain/Search/issues/335#issuecomment-819339941, there was for example already an issue for one point.
Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity? That is not clear at the moment which amount of extra work this task unlisted in the issue description could take (e.g. moving away from the ee_models_library.csv logic).
Should this issue also deal with building the new train / dev sets with only one entity per set? Currently, 3 out of 5 annotation sets have multiple entities instead of one.
Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in #328? The currently used elements from scispaCy are: a custom tokenizer, a custom vocabulary, and several trained components (tagger, attribute_ruler, lemmatizer, parser, ner).
Shouldn't the issue be started only after the conclusions of some experiments, like #343 and #337? Indeed, if the results are not reproducible on GPU (#343), this will impact the current issue (e.g. should accuracy be favoured over reproducibility?, should the tracking of the NER models with DVC be stopped?). Besides, if the runtime on CPU with a transformer component (#337) has increased too much compared to before, BBS would need to be adapted (e.g. switching to inference on GPU instead of CPU).
Should we decide of a convention for the new model names? Currently they are model1, ..., model5 but this issue wants to create one model per entity. Some entity (e.g. cell compartement) have spaces in their names.
Should this issue also deal with changing the pipeline/ner/Dockerfile based on miniconda for one with GPU support? Indeed, GPU support is required for fine-tuning the transformer component.

pafonta commented 3 years ago

All points above, except the following one, are independent of implementation constraints.

For the following point, some implementation constraints could apply (tested on spaCy 3.0.5).

Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in #328? The currently used elements from scispaCy are: a custom tokenizer, a custom vocabulary, and several trained components (tagger, attribute_ruler, lemmatizer, parser, ner).

I have investigated. Technically, this is possible to reuse from scispaCy:

the custom tokenizer,
the custom vocabulary,
the trained tagger, attribute_ruler, lemmatizer, and parser components.

The trained ner component cannot be reused but that is expected. Indeed, we use a different embedding layer type. That's it: scispaCy used a spacy.Tok2Vec while the ticket is about using a spacy-transformers.TransformerModel).

➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the scispaCy elements.

pafonta commented 3 years ago

➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the scispaCy elements.

TL;DR

In the context of the experiment (see Methods below), re-using scispaCy components gives lower F1 and precision but higher recall.

Results

The pipeline B is the pipeline A but with the tokenizer, the vocabulary, the tagger, the attribute_ruler, the lemmatizer, and the parser from scispaCy.

reusing scispaCy parts

Method

Performance scores are from pipelines/ner/eval.py. Reported scores are an average on two trainings.

Both pipelines use the same train, dev, and test sets and Transformer checkpoint.

Pipeline B uses en_ner_bc5cdr_md as source for the scispaCy components.

FrancescoCasalegno commented 3 years ago

1. Are all the checkboxes in the issue description expected to be done in this issue?

Let's try to do so, unless some logical steps can be isolated (see also next point).

2. Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity?

Let's try to keep using ee_models_library.csv even though it will be redundant and trivial (one line per entity type). Then, let's work on a separate issue to get rid of this file which is indeed becoming obsolete → see Issue #349.

3. Should this issue also deal with building the new train / dev sets with only one entity per set?

Let's first try to implement the following points:

one train JSONL file per entity type (also test JSONL file?)
preprocess step split in 2: train_valid_split + convert_jsonl_spacy then request review, and once it's OK don't merge into master but keep working to add the remaining features. Why?: because this is the only way to bring master from a valid, complete state to another that has the same properties ("consistency").

4. Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in #328?

Based on the results shown here and the discussion with the team, let's stick with a brand new pipeline. This is easier and also seems to provide better F1 score.

5. Shouldn't the issue be started only after the conclusions of some experiments, like #343 and #337?

Yes, let's wait for #343 and #337 to be done!

6. Should we decide of a convention for the new model names?

Convention we agreed upon is: model-<entity_type_name>—so for instance we could have model-cell_compartment.

7. Should this issue also deal with changing the pipeline/ner/Dockerfile based on miniconda for one with GPU support?

How long do the train_xxx phases take if run on CPU? I would say we should try to avoid assuming that the user has access to a GPU, even if this comes at the cost of slightly longer runtimes. Anyway, let's also wait for the results of #343 and #337.

pafonta commented 3 years ago

How long do the train_xxx phases take if run on CPU?

With the exact same settings, training on CPU (80 cores) is 4.730 times slower than training on GPU (1).

It takes 2.840 secs / step on CPU while it takes 0.6005 sec / step on GPU.

pafonta commented 3 years ago

UPDATE Create one JSONL per entity from the JSONLs containing several entities.

While working on the task (see previous line), I have noticed several issues with the annotations:

Some texts are ignored. They should then be deleted?
Some texts are ignored but have spans. They should be checked and then deleted?
Some annotation files have duplicated _input_hash. They should be checked and then only one duplicate be kept?
Some annotation files have labels in different cases. The case should be normalized?

pafonta commented 3 years ago

UPDATE Automate analysis of annotations

I have made a code (analyze.py) to analyze annotations in order to clean them up for training NER models. The code could either be used as a script or as a function. As a script, it lets analyze an individual annotation file with. As a function, it lets analyze several annotation files while retrieving the results.

The goal is to reuse the code in the future to analyze new annotations. The code is added to the PR in https://github.com/BlueBrain/Search/pull/351/commits/020d6f41bbc7db96cfcee61fb720ea9da7c1bf26. There is an extensive documentation inside the file.

Note for both sections below: Valid texts are texts which are accepted texts and which contains spans (i.e. entities).

Here is the summary table created with the code as a function:

for all the annotation files used for training and evaluation

| | total texts | ignored texts w/o spans | ignored texts w/ spans | accepted texts w/o spans | accepted texts w/ spans | accepted texts w/ spans (%) | duplicated '_input_hash' | accepted texts w/ multiple labels | labels needing normalization | |:--------------|--------------:|--------------------------:|-------------------------:|---------------------------:|--------------------------:|------------------------------:|---------------------------:|------------------------------------:|-------------------------------:| | annotations5 | 207 | 19 | 2 | 114 | 72 | 34.78 | 1 | 0 | 0 | | annotations6 | 634 | 104 | 0 | 241 | 289 | 45.58 | 4 | 24 | 0 | | annotations9 | 131 | 7 | 0 | 45 | 79 | 60.31 | 0 | 31 | 0 | | annotations14 | 383 | 7 | 13 | 57 | 304 | 79.37 | 0 | 35 | 0 | | annotations15 | 151 | 3 | 0 | 33 | 115 | 76.16 | 0 | 0 | 0 | | annotations10 | 309 | 3 | 1 | 6 | 299 | 96.76 | 0 | 252 | 7 | | annotations12 | 23 | 1 | 0 | 2 | 20 | 86.96 | 0 | 20 | 0 |

Here is an example of individual report created by the code as a script:

python analyze.py --verbose annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl

``` Analyzing annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl... total texts: 207 ignored texts w/o spans: 19 ### INFO ### This might not be expected. (only the first 10 rows are displayed) text spans 42 4 None 49 Redelman-Sidi et al. [ None 75 2006) . None 78 † None 91 55 None 94 * None 95 2017) . None 133 S0. None 136 2012; Wang et al., None 137 2016) ]. None ignored texts w/ spans: 2 ### WARNING ### There might be an issue. text spans 55 Deshalb wurden Bioindikatoren entwickelt, dere... [DISEASE] 116 nodeficiency syndrome (AIDS). [DISEASE, DISEASE] accepted texts w/o spans: 114 ### INFO ### This might not be expected. (only the first 10 rows are displayed) text spans 2 This analysis and report was approved by the I... None 3 Before their postmortem examinations, all the ... None 5 In a contemporary study, it was revealed that ... None 7 Clinical information is indispensable and micr... None 9 Although the beneficial effects of EP4 agonist... None 10 Nineteen patients (95%) had fourfold or greate... None 11 Consequently, extensive shotgun sequencing may... None 12 Pairwise distance among coronaviruses (CoVs), ... None 13 In the cranial concept, during cranial externa... None 14 The PanAd3-based vaccine was tested for induct... None accepted texts w/ spans: 72 (only the first 10 rows are displayed) text spans 0 Diarrhea can develop as a result of C. parvum ... [DISEASE, DISEASE, DISEASE] 1 Neonates with febrile pneumonia generally shou... [DISEASE] 4 6, 7 Set against the interconnectedness of onl... [DISEASE, DISEASE, DISEASE, DISEASE] 6 HIV-1, the main cause of acquired immune defic... [DISEASE, DISEASE] 8 Furthermore, rigorous selection is also effect... [DISEASE, DISEASE] 16 However, detection and response capacity of CO... [DISEASE] 26 Methods: All patients with the diagnosis of pr... [DISEASE, DISEASE] 32 Those results suggest that the enzyme plays a ... [DISEASE, DISEASE] 36 GIVA cPLA 2 has been shown to increase and med... [DISEASE, DISEASE] 37 Taken together, the observations suggest that ... [DISEASE] accepted texts w/ spans (%): 34.78 duplicated '_input_hash': 1 ### CRITICAL ### There is an issue to investigated. text meta _input_hash _task_hash tokens spans _session_id _view_id answer 21 ( {'source': 'CORD-19 - sentence 6254790'} -996919429 -247505887 [{'text': '(', 'start': 0, 'end': 1, 'id': 0}] None NaN ner_manual accept 197 ( {'source': 'CORD-19 - sentence 3673'} -996919429 -247505887 [{'text': '(', 'start': 0, 'end': 1, 'id': 0}] None NaN ner_manual ignore accepted texts w/ multiple labels: 0 Empty DataFrame Columns: [text, spans, spans_count] Index: [] labels needing normalization: 0 labels before normalization: DISEASE 142 labels after normalization: DISEASE 142 ...analyzed annotations5_EmmanuelleLogette_2020-06-30_raw2_Disease.jsonl ```

pafonta commented 3 years ago

UPDATE Analyze the annotations to define cleaning rules (for the next line).

Here is:

a description of the cases which are identified as invalid texts,
the causes that would need to be fixed for new annotations (for example with additions to #342).

Case 1 - Ignored texts (w/o and w/ spans)

Causes:

sentence selection (e.g. annotations5)
- sentences are not sentences
- sentences are not in English
annotation guidelines (e.g. annotations14)
- texts have been annotated but are ignored

Case 2 - Duplicated `_input_hash`

Potential causes:

Prodigy import/export

Observations:

the 11 texts are single character texts
not all texts are ignored (e.g. annotations5)

Case 3 - Inconsistent case for labels

Causes:

annotation session setup

Example from annotations10 (numbers are label counts):

CELL_TYPE  66
CHEMICAL 100
CONDITION  59
DISEASE 241
ORGANISM 174
PATHWAY  97
PROTEIN 228

cell_type  30
chemical  26
condition  49
disease  76
organism  25
pathway  38
protein  30

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

The evaluation is crashing. After investigation, the bug was introduced by #348.

Would you know why?

The bug:

Traceback (most recent call last):
  File "eval.py", line 111, in <module>
    main()
  File "eval.py", line 107, in main
    json.dump(all_metrics_dict, f)
  File "/opt/conda/lib/python3.8/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/opt/conda/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/opt/conda/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/opt/conda/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/opt/conda/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

To reproduce:

# For the bug introduced by #348, use 0bb500551b1b7c6f5bb9228335aa4df30a654e9c.
# For the working code __before__ #348, use b9c886966ca4d893b41457a17262e198e3ba7f03.
export COMMIT=...

git clone https://github.com/BlueBrain/Search
cd Search/

# Change <image> and <container>.
docker build -f data_and_models/pipelines/ner/Dockerfile --build-arg BBS_REVISION=$COMMIT -t <image> .
docker run -it --rm -v /raid:/raid --name <container> <image>

git checkout $COMMIT
git checkout -- data_and_models/pipelines/ner/dvc.lock

cd data_and_models/pipelines/ner/
dvc pull --with-deps evaluation@organism
dvc repro -fs evaluation@organism

FrancescoCasalegno commented 3 years ago

Hi @pafonta, https://github.com/BlueBrain/Search/pull/362 should fix it, can you please have a look?

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

I have 1) merged #362 on top of #351 and 2) locally removed the fix float(v) in ner/eval.py.

The evaluation stage worked! Thank you 🎉

NB: For practical reasons, I keep float(v) in ner/eval.py for #351 until #362 is merged.

FrancescoCasalegno commented 3 years ago

@pafonta FYI #362 just merged into master 🙂

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

Could you confirm your position on the following points for #351?

Should the training happens on GPU (nVidia)?
Should the inference happens on CPU (Intel) ?
What revision should be put in ner/Dockerfile? I would recommend v0.2.0a for the reason below.

Regarding point 3: From my understanding, our current handling of DVC + Docker makes mandatory to use a version tag in the Dockerfile. Indeed, we cannot know before merging the commit hash of the merge commit, or do we? Besides, we cannot release v0.2.0 yet as other elements are part of it (cf. https://github.com/BlueBrain/Search/milestone/1).

FrancescoCasalegno commented 3 years ago

Hi!

Here are my 2 cents. The following points are what I think we should keep in mind.

We should not release a new version if results are not reproducible.
We currently know (see #343) that using spacy 3.x will give non-deterministic results with any of (a) using a transformer backbone or (b) using GPU.
To fix those 2 issues we are working on the following. a. Created PR on pytorch https://github.com/pytorch/pytorch/pull/57536 + created local PR to patch torch while waiting for the pytorch PR to get merged and #359. b . Created issue on spacy https://github.com/explosion/spaCy/issues/7981.
As long as (b) is not solved, IMO the next release will have to use CPU. I understand that this could take forever with the current settings, that's why we have #356.
I would also say, that as long as at lest one of the 2 PR proposed for (a) is not merged, this PR cannot be merged either (since results wouldn't be deterministic).

Given those points:

Should the training happens on GPU (nVidia)?

Not sure this would be good. Because we would need to modify the Dockerfiles of dvc, but most important because results wouldn't be determinstic.

Should the inference happens on CPU (Intel) ?

I don't think this really matters, or does it? Results should be the same on CPU and GPU, and moreover the runtime should be irrelevant in both cases.

What revision should be put in ner/Dockerfile? I would recommend v0.2.0a for the reason below.

I see our re-occurring nightmare here 😁 Sure, v0.2.0a is a viable solution, otherwise v0.0.11 is also ok with me.

pafonta commented 3 years ago

Thank you very much for the clarifications!

as long as at lest one of the 2 PR proposed for (a) is not merged, this PR cannot be merged either

So, should we mark #351 as blocked until #359 is merged?

Otherwise, I have updated the TODO list of #351 to:

include the switch to training on CPU,
use v0.2.0a as tag for the DVC + Docker magic.

Yes, using the revision in the Dockerfile for DVC is a nightmare. Maybe we could come up with something easier.

BlueBrain / Search