DFKI-NLP / sherlock

State-of-the-art Information Extraction
3 stars 1 forks source link

Prevent special tokens (e.g. `[HEAD_START]`) from being separated for *arbitrary tokenizers* #42

Closed GabrielKP closed 2 years ago

GabrielKP commented 2 years ago

The tricky part is that with AllenNLP we will not know which tokenizer will be used: it could not be that it has a function add_tokens to prevent it from tokenizing special tokens.

Solution proposal: In the FeatureConverter, separately encode special tokens from the input text, and after tokenization add both together.

A different solution would be to leave to the endscript/user to initialize the tokenizer with additional tokens appropriately (but then you would need to read the dataset, before initializing the featureconverter, just to get the labels for the extra tokens, which could be computationally inefficient).

The first solution is the cleanest but is a ~1 week effort. The second solution moves the effort to the future, thus I dislike it.

It only makes sense to fix this problem if there actually is the need to use a different Tokenizer other than the Pretrained tokenizer, that automatically splits special Tokens.

GabrielKP commented 2 years ago

Test if the default allennlp tokenizer splits these special tokens @GabrielKP

GabrielKP commented 2 years ago

Maybe your opinion on this @leonhardhennig ?

leonhardhennig commented 2 years ago

Christoph has implemented your suggestion 1 in his new IE repository, maybe we can "copy" from there? In general, I'd prefer solution 1. But as you wrote, the real question is - do we use other tokenizers than the Pretrained.

Can we just document this and add a warning if the tokenizer is being redefined?

GabrielKP commented 2 years ago

Because I had no other choice I implemented choice 2 for allennlp now anways. I will look to implement the first solution after getting the model to run as it is now

leonhardhennig commented 2 years ago

Re choice 1 - the code snippet would be in https://github.com/ChristophAlt/pytorch-ie/blob/main/pytorch_ie/taskmodules/transformer_re_text_classification.py lines#_single_pair_insert_marker . First, create encoding of raw text using tokenizer, then insert head/tail markers as special tokens (line 272-282) and extend embedding matrix. Finds head/tail tokens via encoding.char_to_token() (lines 201/202).

GabrielKP commented 2 years ago

The suggested code proposal is elegant but not straight forward to apply to this code base: it handles ID's directly, but this code is structured to not convert it to ids directly (even though part of the code actually does this (see 3)).

Further issue in this code base arises due to following 4 factors:

  1. handling different marker modes: "mark_entity", "mark_entity_append_ner", "mask_entity", "mask_entity_append_text"

  2. handling transformers vs allennlp

  3. handling "inconsistencies" between Tokenizers within allennlp

  4. I did not want to completely rewrite the entire entity handling procedure.

  5. and 2. are not that hard to solve, but the 3rd is a bit tricky: e.g. the spacy allennlp tokenizer will always split special tokens, and thus should not be allowed to tokenize them (by only tokenizing non special_tokens with it), but the transformers allennlp tokenizer is necessarily required to tokenize all tokens, as it secretly also indexes each token (and the TokenIndexer only moves the index into the right field).

I solved the problem now by implementing a "procedural" version of the code snippet above. The tokens are jointly tokenized until they encounter a marker. The marker is handled according to the correct case for 1, 2 and 3, and this is done with the next following tokens. This also has the benefit to easily check whether an entity was cut off or only "unimportant text". I implemented the check and instances where entities are truncated are taken out entirely, whereas instances where "unimportant text" is truncated have the metadata field "truncated" set to True. The tests have been adapted accordingly.