Closed FrancescoCasalegno closed 3 years ago
Regarding this point:
Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component.
An existing issue is about the same: #310.
As this issue is addressed, there is also the following action to be done (see also PR #345).
en_ner_bc5cdr_md
en_ner_bionlp13cg_md
en_ner_craft_md
en_ner_jnlpba_md
Hello @FrancescoCasalegno,
Before moving further with this task, I think some points should be clarified:
Are all the checkboxes in the issue description expected to be done in this issue, which implies, in one pull request? Because that is a lot of work. The current issue could be split into several incremental pieces of work. As said in https://github.com/BlueBrain/Search/issues/335#issuecomment-819339941, there was for example already an issue for one point.
Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity?
That is not clear at the moment which amount of extra work this task unlisted in the issue description could take (e.g. moving away from the ee_models_library.csv
logic).
Should this issue also deal with building the new train / dev sets with only one entity per set? Currently, 3 out of 5 annotation sets have multiple entities instead of one.
Are the new models expected to use part of the scispaCy
elements used currently? Or should it be a brand new ["transformer", "ner"]
pipeline like in #328?
The currently used elements from scispaCy
are: a custom tokenizer, a custom vocabulary, and several trained components (tagger
, attribute_ruler
, lemmatizer
, parser
, ner
).
Shouldn't the issue be started only after the conclusions of some experiments, like #343 and #337? Indeed, if the results are not reproducible on GPU (#343), this will impact the current issue (e.g. should accuracy be favoured over reproducibility?, should the tracking of the NER models with DVC be stopped?). Besides, if the runtime on CPU with a transformer component (#337) has increased too much compared to before, BBS would need to be adapted (e.g. switching to inference on GPU instead of CPU).
Should we decide of a convention for the new model names?
Currently they are model1, ..., model5
but this issue wants to create one model per entity. Some entity (e.g. cell compartement) have spaces in their names.
Should this issue also deal with changing the pipeline/ner/Dockerfile
based on miniconda
for one with GPU support?
Indeed, GPU support is required for fine-tuning the transformer
component.
All points above, except the following one, are independent of implementation constraints.
For the following point, some implementation constraints could apply (tested on spaCy 3.0.5
).
- Are the new models expected to use part of the
scispaCy
elements used currently? Or should it be a brand new["transformer", "ner"]
pipeline like in #328? The currently used elements fromscispaCy
are: a custom tokenizer, a custom vocabulary, and several trained components (tagger
,attribute_ruler
,lemmatizer
,parser
,ner
).
I have investigated. Technically, this is possible to reuse from scispaCy
:
tagger
, attribute_ruler
, lemmatizer
, and parser
components.The trained ner
component cannot be reused but that is expected. Indeed, we use a different embedding layer type. That's it: scispaCy
used a spacy.Tok2Vec
while the ticket is about using a spacy-transformers.TransformerModel
).
➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the scispaCy
elements.
➡️ I will post on Monday a performance (entity F1) comparison between reusing or not the
scispaCy
elements.
In the context of the experiment (see Methods below), re-using scispaCy
components gives lower
F1 and precision but higher
recall.
The pipeline B
is the pipeline A
but with the tokenizer, the vocabulary, the tagger, the attribute_ruler, the lemmatizer, and the parser from scispaCy
.
Performance scores are from pipelines/ner/eval.py
. Reported scores are an average
on two trainings.
Both pipelines use the same
train, dev, and test sets and Transformer checkpoint.
Pipeline B uses en_ner_bc5cdr_md
as source for the scispaCy
components.
1. Are all the checkboxes in the issue description expected to be done in this issue?
Let's try to do so, unless some logical steps can be isolated (see also next point).
2. Should this issue also deal with migrating the whole NER elements in BBS from using some models for several entities to using one model per entity?
Let's try to keep using ee_models_library.csv
even though it will be redundant and trivial (one line per entity type).
Then, let's work on a separate issue to get rid of this file which is indeed becoming obsolete → see Issue #349.
3. Should this issue also deal with building the new train / dev sets with only one entity per set?
Let's first try to implement the following points:
preprocess
step split in 2: train_valid_split
+ convert_jsonl_spacy
then request review, and once it's OK don't merge into master
but keep working to add the remaining features.
Why?: because this is the only way to bring master
from a valid, complete state to another that has the same properties ("consistency").4. Are the new models expected to use part of the scispaCy elements used currently? Or should it be a brand new ["transformer", "ner"] pipeline like in #328?
Based on the results shown here and the discussion with the team, let's stick with a brand new pipeline. This is easier and also seems to provide better F1 score.
5. Shouldn't the issue be started only after the conclusions of some experiments, like #343 and #337?
Yes, let's wait for #343 and #337 to be done!
6. Should we decide of a convention for the new model names?
Convention we agreed upon is: model-<entity_type_name>
—so for instance we could have model-cell_compartment
.
7. Should this issue also deal with changing the pipeline/ner/Dockerfile based on miniconda for one with GPU support?
How long do the train_xxx
phases take if run on CPU?
I would say we should try to avoid assuming that the user has access to a GPU, even if this comes at the cost of slightly longer runtimes. Anyway, let's also wait for the results of #343 and #337.
How long do the
train_xxx
phases take if run on CPU?
With the exact same settings, training on CPU (80 cores) is 4.730 times slower
than training on GPU (1).
It takes 2.840 secs / step on CPU while it takes 0.6005 sec / step on GPU.
UPDATE Create one JSONL
per entity from the JSONL
s containing several entities.
While working on the task (see previous line), I have noticed several issues with the annotations:
ignored
. They should then be deleted?have
spans. They should be checked and then deleted?_input_hash
. They should be checked and then only one duplicate be kept?different
cases. The case should be normalized?UPDATE Automate analysis of annotations
I have made a code (analyze.py
) to analyze annotations in order to clean them up for training NER models. The code could either be used as a script
or as a function
. As a script, it lets analyze an individual
annotation file with. As a function, it lets analyze several
annotation files while retrieving the results.
The goal is to reuse the code in the future
to analyze new
annotations. The code is added to the PR in https://github.com/BlueBrain/Search/pull/351/commits/020d6f41bbc7db96cfcee61fb720ea9da7c1bf26. There is an extensive documentation
inside the file.
Note for both sections below:
Valid
texts are texts which are accepted texts and which contains spans (i.e. entities).
Here is the summary table
created with the code as a function:
Here is an example of individual report
created by the code as a script:
UPDATE Analyze the annotations to define cleaning rules (for the next line).
Here is:
invalid texts
,causes
that would need to be fixed for new annotations (for example with additions to #342).Causes:
sentence selection
(e.g. annotations5)
annotation guidelines
(e.g. annotations14)
_input_hash
Potential causes:
Prodigy import/export
Observations:
Causes:
annotation session setup
Example from annotations10 (numbers are label counts):
CELL_TYPE 66
CHEMICAL 100
CONDITION 59
DISEASE 241
ORGANISM 174
PATHWAY 97
PROTEIN 228
cell_type 30
chemical 26
condition 49
disease 76
organism 25
pathway 38
protein 30
Hello @FrancescoCasalegno,
The evaluation is crashing. After investigation, the bug was introduced by #348.
Would you know why?
The bug:
Traceback (most recent call last):
File "eval.py", line 111, in <module>
main()
File "eval.py", line 107, in main
json.dump(all_metrics_dict, f)
File "/opt/conda/lib/python3.8/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/opt/conda/lib/python3.8/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/opt/conda/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/opt/conda/lib/python3.8/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/opt/conda/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable
To reproduce:
# For the bug introduced by #348, use 0bb500551b1b7c6f5bb9228335aa4df30a654e9c.
# For the working code __before__ #348, use b9c886966ca4d893b41457a17262e198e3ba7f03.
export COMMIT=...
git clone https://github.com/BlueBrain/Search
cd Search/
# Change <image> and <container>.
docker build -f data_and_models/pipelines/ner/Dockerfile --build-arg BBS_REVISION=$COMMIT -t <image> .
docker run -it --rm -v /raid:/raid --name <container> <image>
git checkout $COMMIT
git checkout -- data_and_models/pipelines/ner/dvc.lock
cd data_and_models/pipelines/ner/
dvc pull --with-deps evaluation@organism
dvc repro -fs evaluation@organism
Hi @pafonta, https://github.com/BlueBrain/Search/pull/362 should fix it, can you please have a look?
Hello @FrancescoCasalegno,
I have 1) merged #362 on top of #351 and 2) locally removed the fix float(v)
in ner/eval.py
.
The evaluation stage worked! Thank you 🎉
NB: For practical reasons, I keep float(v)
in ner/eval.py
for #351 until #362 is merged.
@pafonta FYI #362 just merged into master
🙂
Hello @FrancescoCasalegno,
Could you confirm your position on the following points for #351?
training
happens on GPU
(nVidia)?inference
happens on CPU
(Intel) ?ner/Dockerfile
? I would recommend v0.2.0a
for the reason below.Regarding point 3:
From my understanding, our current handling of DVC + Docker makes mandatory to use a version tag
in the Dockerfile
. Indeed, we cannot know before merging the commit hash of the merge commit, or do we?
Besides, we cannot release v0.2.0
yet as other elements are part of it (cf. https://github.com/BlueBrain/Search/milestone/1).
Hi!
Here are my 2 cents. The following points are what I think we should keep in mind.
spacy 3.x
will give non-deterministic results with any of (a) using a transformer
backbone or (b) using GPU.pytorch
https://github.com/pytorch/pytorch/pull/57536 + created local PR to patch torch
while waiting for the pytorch
PR to get merged and #359.
b . Created issue on spacy
https://github.com/explosion/spaCy/issues/7981.Given those points:
Should the training happens on GPU (nVidia)?
Not sure this would be good. Because we would need to modify the Dockerfiles
of dvc
, but most important because results wouldn't be determinstic.
Should the inference happens on CPU (Intel) ?
I don't think this really matters, or does it? Results should be the same on CPU and GPU, and moreover the runtime should be irrelevant in both cases.
What revision should be put in ner/Dockerfile? I would recommend v0.2.0a for the reason below.
I see our re-occurring nightmare here 😁 Sure, v0.2.0a
is a viable solution, otherwise v0.0.11
is also ok with me.
Thank you very much for the clarifications!
as long as at lest one of the 2 PR proposed for (a) is not merged, this PR cannot be merged either
So, should we mark #351 as blocked
until #359 is merged?
Otherwise, I have updated the TODO list of #351 to:
CPU
,v0.2.0a
as tag for the DVC + Docker magic.Yes, using the revision in the Dockerfile for DVC is a nightmare. Maybe we could come up with something easier.
🚀 Feature
In light of what we discussed in PR #328, and in particular looking at the results shown in this comparison table, we should operate the following changes to the NER models in
data_and_models/pipelines/ner
.transformer
backbone. Now they are usingtok2vec
.CORD19 NLI+STS v1
. Also, the weights should not be frozen during the fine-tuning on the NER task.model2
is used to extract 3 different entity types (see table here).add_er.py
, then in the future we'll improve that with #310).