dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
http://doi.org/10.1093/bioinformatics/btz682
Other
1.9k stars 450 forks source link

NER detokenize index error #109

Open atulkakrana opened 4 years ago

atulkakrana commented 4 years ago

Hi Authors, I am trying to fine tune BioBERT for NER task. I am using datasets from BioNLP challenges. I am having two issues:

ISSUE-1

I see thousands of such warnings from ner_detokenize.py:

## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
   -> Showing 10 words near skipped part : x C57BL / 6 ) F1 mice . [SEP] [CLS] We

I checked the sentences and their length is not longer than max_seq_length value (max_seq_length=256) as the warning says. For example, here is sentence that is referred to in the warning above:

[CLS]
The
trans
##gene
was
pu
##rified
and
injected
into
C
##5
##7
##BL
/
6
##J
x
CB
##A
F1
z
##y
##got
##es
.
[SEP]

Please tell me what could be causing these warnings?

ISSUE-2

The ner_detokenize.py is throwing this error:

idx:  179999 offset:  116302
idx:  180000 offset:  116302
idx:  180001 offset:  116302
Traceback (most recent call last):
  File "biocodes/ner_detokenize.py", line 159, in <module>
    transform2CoNLLForm(golden_path=args.answer_path, output_dir=args.output_dir, bert_pred=bert_pred, debug=args.debug)
  File "biocodes/ner_detokenize.py", line 137, in transform2CoNLLForm
    if ans['labels'][idx+offset] != '[SEP]':
IndexError: list index out of range

Can you please me what's causing this error?

I will really appreciate your help

AK

atulkakrana commented 4 years ago

Here are my commands:

# predict labels for test set
mkdir $OUTPUT_DIR
python3 run_ner.py \
    --do_train=false \
    --do_predict=true \
    --vocab_file=$BIOBERT_DIR/vocab.txt \
    --bert_config_file=$BIOBERT_DIR/bert_config.json \
    --init_checkpoint=$TRAINED_CLASSIFIER \
    --data_dir=$NER_DIR/ \
    --max_seq_length=256 \
    --output_dir=$OUTPUT_DIR

## compute entity level performance
python3 biocodes/ner_detokenize.py \
   --token_test_path=$OUTPUT_DIR/token_test.txt \
   --label_test_path=$OUTPUT_DIR/label_test.txt \
   --answer_path=$NER_DIR/test.tsv \
   --output_dir=$OUTPUT_DIR
wonjininfo commented 4 years ago

Hi AK, The error (ISSUE-1) is raised when ner_detokenizer.py cannot revert the original sentence. (Raised when reverted sentence != original pre-processed sentence from answer_path)

It seems like that the tokenization of the original dataset is not compatible with the BERT BPE tokenizer. I am not sure since I wasn't able to check the original dataset and your pre-processed dataset, but / near C57BL / 6 seems to be a weak part.
You need to add extra spacing before and after / so that it can have an independent line in the original sentence (Check $NER_DIR/test.tsv).

Also, #107 may help you.

ISSUE-2 will be resolved when ISSUE-1 gets solved.

PS) I am thinking of releasing a tokenization code in the near future. Unfortunately, I (and so as the other authors) am extremely busy due to this difficult situation (COVID19) and do not have enough time to respond.

We will get back to the topic soon! Thanks and take care Wonjin

atulkakrana commented 4 years ago

Hi Wonjin, Thanks for much for prompt help. I used SpaCy to tokenize (i.e. preprocess) my dataset. I think there might be two separate issues that's causing the problem. Please see below.

Issue with my preprocessing

## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
   -> Showing 10 words near skipped part : diseases comprise Parkinson 's disease , Huntington 's disease and Alzheimer

I tokenized this sentence using BioBERT's BasicTokenizer and FullTokenizer. Then compared it with output from my preprocessing as well as with the token_test.txt generated during predictions (inferencing) from BioBert.

My Tokenizer BioBert Basic Tokenizer BioBert Full Tokenizer Tokens_test.txt
In ['In', ['In', In
addition 'addition', 'addition', addition
, ',', ',', ,
the 'the', 'the', the
compound 'compound', 'compound', compound
of 'of', 'of', of
the 'the', 'the', the
invention 'invention', 'invention', invention
can 'can', 'can', can
be 'be', 'be', be
used 'used', 'used', used
for 'for', 'for', for
preparing 'preparing', 'preparing', preparing
medicines 'medicines', 'medicines', medicines
for 'for', 'for', for
preventing 'preventing', 'preventing', preventing
or 'or', 'or', or
treating 'treating', 'treating', treating
neurodegenerative 'neurodegenerative', 'ne', ne
diseases 'diseases', '##uro', ##uro
caused 'caused', '##de', ##de
by 'by', '##gene', ##gene
free 'free', '##rative', ##rative
radicals 'radicals', 'diseases', diseases
oxidative 'oxidative', 'caused', caused
damage 'damage', 'by', by
, ',', 'free', free
wherein 'wherein', 'radical', radical
the 'the', '##s', ##s
neurodegenerative 'neurodegenerative', 'o', o
diseases 'diseases', '##xi', ##xi
comprise 'comprise', '##da', ##da
Parkinson 'Parkinson', '##tive', ##tive
's "'", 'damage', damage
disease 's', ',', ,
, 'disease', 'wherein', wherein
Huntington ',', 'the', the
's 'Huntington', 'ne', ne
disease "'", '##uro', ##uro
and 's', '##de', ##de
Alzheimer 'disease', '##gene', ##gene
's 'and', '##rative', ##rative
disease 'Alzheimer', 'diseases', diseases
. "'", 'comprise', comprise
  's', 'Parkinson', Parkinson
  'disease', "'", '
  '.'] 's', s
    'disease', disease
    ',', ,
    'Huntington', Huntington
    "'", '
    's', s
    'disease', disease
    'and', and
    'Alzheimer', Alzheimer
    "'", '
    's', s
    'disease', disease
    '.'] .

Tokens from from my preprocessing do not match with BioBert's BasicTokenizer. This is the problem that you mentioned in your last comment.

Question-1: Do you think replacing spaCy with BasicTokenizer in my workflow will solve this issue?

Issue with ner_detokenize.py

## The predicted sentence of BioBERT model looks like trimmed. (The Length of the tokenized input sequence is longer than max_seq_length); Filling O label instead.
   -> Showing 10 words near skipped part : Compared with similar compounds , the phenylamine acid compound has good

Here are tokens from different tokenizers and token_test.txt for this sentence

My tokenizer BioBert BasicTokenizer BioBert FullTokenizer Tokens_test.txt
Compared ['Compared', ['Compared', Compared
with 'with', 'with', with
similar 'similar', 'similar', similar
compounds 'compounds', 'compounds', compounds
, ',', ',', ,
the 'the', 'the', the
phenylamine 'phenylamine', 'p', p
acid 'acid', '##hen', ##hen
compound 'compound', '##yla', ##yla
has 'has', '##mine', ##mine
good 'good', 'acid', acid
effect 'effect', 'compound', compound
of 'of', 'has', has
inducing 'inducing', 'good', good
the 'the', 'effect', effect
activation 'activation', 'of', of
of 'of', 'in', in
HIV 'HIV', '##ducing', ##ducing
latent 'latent', 'the', the
cells 'cells', 'activation', activation
, ',', 'of', of
and 'and', 'HIV', HIV
mainly 'mainly', 'late', late
has 'has', '##nt', ##nt
low 'low', 'cells', cells
toxicity 'toxicity', ',', ,
to 'to', 'and', and
cells 'cells', 'mainly', mainly
. '.'] 'has', has
    'low', low
    'toxicity', toxicity
    'to', to
    'cells', cells
    '.'] .

In this case, tokens from my preprocessing workflow matches with those from BioBert's BasicTokenizer and they match with the tokens_test.txt from predictions if detokenized properly.

Question-2: Do you think there is a bug in ner_detokenize.py? It's making mistake in reconstructing the original tokens and there are no issues with my preprocessing?

Question-3: Only ner_detokenize step is affected or dependent on the specific pre-processing workflow that your group uses? I mean run_ner.py for fine-tuning and inferencing is independent of your pre-processing, and I can write my own detokenizer script to parse the predictions (token_test.txt and label_test.txt) and compute entity-level performance.

atulkakrana commented 4 years ago

Hi @wonjininfo , I updated my preprocessing workflow. Now it uses BioBert's tokenizers. I guess that fixes "Issue with my processing," as mentioned in my last comment.

I still get the warnings and error as posted in my first comment. So, I guess the issue with "ner_detokenize," as mentioned in my last comment, still exists. Any thoughts?

AK

AndreasSaka commented 4 years ago

@atulkakrana I am having similar problems when I fine-tune on multiple entities. Did you find a workaround? Can you share, please?

wonjininfo commented 4 years ago

Hi all, Would you check this comment? : https://github.com/dmis-lab/biobert/issues/107#issuecomment-615558492

Hi, the pre-processing of the datasets was mostly done by other co-authors. I tried nltk for my other project, but it seems like nltk is not compatible with the BERT tokenizer (especially near special characters). So I get tokenizer code from this repository by co-authors and modified it for my own use (see the end of this comment for the modified code).