Open mayhewsw opened 5 years ago
We'd certainly be open to taking a PR for this assuming the issue you highlight can be resolved in a clean fashion. The other issue that we're unsure about is whether this would entail adding Flair itself as a dependency which we would like to avoid. Thank you.
I've made a quick and dirty implementation, but it does indeed add Flair as a dependency, which I totally agree is not so clean. That said, the code is relatively simple, perhaps it could be implemented directly in allennlp.
So, yeah, given that, I'd say there are two options: (1) keep this as a separate add on to allennlp, that adds a few Registrable
components if you want them, so we don't add the dependency directly to allennlp (bonus if it's also pip-installable), or (2) do whatever needs to be done so we can load and use flair embeddings without having to import flair. I have nothing against flair, we just already get a bunch of complaints about too many dependencies in the core library, and requests to split things out.
Not sure I understand option 1: "separate add on" means, for example, my code stays in my repo, but can be easily added to allennlp (maybe with pip)? I like this idea.
Yeah, it's basically like a separate allennlp-contrib
repo. We've talked about maintaining one of these ourselves, but I don't think we're ready to do that at this point - maybe someday we'll split things out a bit more, and then something like this would make sense for us to do. But if you want to maintain a repo with additional pip-installable components, I'd say go for it. I think all you would have to do would be to use --include-package
with whatever package you pip installed.
Aside from using FLAIR's specific implementation, there could be a lot of use in creating a generic sentence-level character encoder. I've seen a slightly different formulation here: https://arxiv.org/abs/1805.08237. The authors concatenate all four edge states for each word, while FLAIR only concatenates two of the four states.
Seems like character-level word embeddings computed on the entire sentence can offer a boost in evaluation performance over computed just on each word, even without pretraining with a LM.
I looked at this a bit more and noticed a potential issue with implementing an indexer. The tokens_to_indices
method in an indexer accepts a list of Token
objects, but this is insufficient to represent the information we need. I.e., the embedder needs to know (1) the word tokens (or alternatively the character offsets) that segment the raw text and (2) the raw text itself. If we just have the word tokens, then we are missing information about separator tokens like whitespace (or no separator). If we just have the raw text, we can't compute offsets for each word.
Unless I'm missing something obvious, there would be required changes along the lines of:
Replace List[Token]
with a Sentence
object which can optionally store the raw text. Then the indexer could use a simple algorithm to scan substrings of tokens in the raw text to find the offsets. Alternatively, the Sentence
could compute this internally.
Require List[Token]
to be formatted a certain way. For instance, each Token
represents a character in the raw text, with special [WORD_START]
/[WORD_END]
tokens that denote word boundaries. This would need a custom tokenizer which may not work with current DatasetReader
s without code changes.
Precompute the offsets when tokenizing. This is likely not easily interoperable with existing tokenizers.
Ignore any intermediate tokens entirely and just use one space between each word. This is would be the simplest and require no interface changes. But this might also cause issues with precomputed FLAIR integration as it is trained with the raw text in mind.
I'm inclined to choose 1., as it would work well with existing tokenizers and dataset readers and would be easier to change in the future. Some datasets already supply tokenized words and so do not have the raw text available. In that case, approximating the raw text by adding a default space separator between word tokens as in 4. could be a compromise.
What do you all think? @joelgrus @matt-gardner
I don't know much about Flair embedddings, but I took a quick look at their paper and it looks like they're just doing character-level embeddings and then taking the last character after each word? This doesn't seem conceptually different from what we're doing for e.g. BERT, where we get one embedding per wordpiece and then (potentially) take the first or last embedding for each word?
Yes, but the BERT wordpieces ignore tokenized whitespace, while FLAIR uses it. Currently, indexers all assume the input is pre-tokenized, but we need the raw text with the whitespace. But we also need to know where the word boundaries are.
wouldn't you just use the character tokenizer (which would keep spaces) and then compute the offsets in the token indexer?
To compute the offsets, we also need to know the word boundaries from the tokenized text as well. We need two pieces of information, but List[Token]
only allows for one.
are the rules for word boundaries that complicated that you couldn't just include them in the token indexer?
No, but you would either (1) add boundary separator tokens beforehand, or (2) make assumptions on how the text was originally tokenized. For instance, if you have the tokens, ["go", "."], was the raw text "go.", or "go ."? Might not make a huge difference, but It's something to consider.
what does "originally tokenized" mean here?
say I have a sentence "go."
I feed that to the character tokenizer and get ["g", "o", "."]
if the sentence were "go .", I would get ["g", "o", " ", "."]
Yes, that's exactly right, but to compute word-level embeddings, you need to also return indices representing the span of each word.
In the case of ["g", "o", "."]
, it would be something like [(0, 1), (2, 2)]
.
In the case of ["g", "o", " ", "."]
, it would be [(0, 1), (3, 3)]
.
Where can we compute these boundaries? The list of tokenized words. But with the tokenized words alone , e.g., ["go", "."]
, we won't know if we have the first case or the second. So what I'm saying is that to faithfully represent the original sentence, we need both the token-level information that captures word boundary information and the raw text information that represents the characters of the words and also between the words (which tokenization erases). Right now, the tokens_to_indices
method prevents an easy way to pass both pieces of information. Unless, of course, we just make a simple assumption like that every word token should have a single space between them (e.g., we always compute character-level info on ["g", "o", " ", "."]
).
ok, I think I get it now. but the spacy tokenizer is already returning the offsets as token.idx
:
In [11]: t = WordTokenizer()
In [12]: tokens = t.tokenize("This isn't it, chief.")
In [13]: for token in tokens:
...: print(token.idx, token)
...:
0 This
5 is
7 n't
11 it
13 ,
15 chief
20 .
is that not sufficient for the token indexer?
That's assuming you tokenized with Spacy. But what if I tokenized with my own tokenizer, or my text is pre-tokenized? Hence, the options I listed above.
if your text is pre-tokenized you're out of luck in any case.
I am extremely comfortable enforcing "if you want to use flair embeddings, you must use a tokenizer that generates offsets (e.g. the default WordTokenizer)", that's much simpler than just about any other solution.
I guess we can leave it at that, then.
But I was hoping to create a generic sentence-level character encoder that could I could use with any dataset. E.g., I primarily use Universal Dependencies, whose data already comes tokenized out of the box. Should I be forced to modify my dataset reader and tokenizers to be able to work with FLAIR? Or can we add a simple function that reconstructs the offsets from the given tokenization, if possible? In the case of no raw text available, then assuming one space between each word token could be sufficient.
And again, if we go with the Spacy tokenizer, we may still need to modify the tokens_to_indices
method to either pass in an extra offsets
parameter or a Sentence
object containing those offsets.
in this case your DatasetReader must be (I assume) somehow creating Token
objects to populate a TextField
? in which case I'd say that yes it's the dataset reader's job to populate the idx
fields of those tokens. if you're primarily using the same dataset, then that's just a small one-time hit to write that code?
It's entirely possible to do this automatically without needing to modify the current dataset readers. Maybe it would be more useful as a utility function. In any case, it's no big deal.
Then my only remaining concern is how to pass both the character tokens and the offsets to the indexer. It will require a change to the indexer interface.
look at how TokenCharactersIndexer.tokens_to_indices
works:
you'd basically just do that, except that you'd have to grab each token.idx
and generate a second vector of offsets to return.
in fact, you could probably just add a new parameter to that token indexer
compute_offsets: bool = False
that if it's true it does that, and then you don't even need to write a new token indexer
Ah, makes sense now. Thanks!
@mayhewsw Would you be willing to share your implementation of including flair embeddings or some pointers on how you did it?
@zeeshansayyed At risk of embarrassing myself, here's a gist with my quick and dirty implementation: https://gist.github.com/mayhewsw/26939faf0a7190a6d174893a31ba0ac8
@dirkgr @matt-gardner if I open a PR based on @mayhewsw work is something that would be accepted?
(Fine with me, fwiw)
Browsing over the code in the gist, I assume the scope of this is just to create the embeddings, but not to make it trainable, right?
@dirkgr yes, this would a be embedding generator only.
It is possible but very tricky to implement it using only the AllenNLP. At least for me. As it is a char lm that use the embedding of the first white char after a word.
But the performance for NER is way better than anything else I tested.
I'm not wild about making Flair a dependency of the core allennlp library, but we could put this into allennlp-models
as an NER model, and go from there? If we then decide to add the other Flair models, maybe we'll expand.
@mayhewsw's gist only provides the embedder. What's involved with making an NER model?
It would be great to see a token embedder for Flair embeddings. They have released an extensive toolkit, including pretrained models, so in theory it could be straightforward to incorporate them.
A complication is that they operate on the character level over the entire sentence, so in order to get word embeddings, one needs to include spans indicating character offsets for each word. The actual values are much different, but the idea is similar in principle to the BERT offsets. Presumably there would need to be a Flair token indexer as well.