flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.82k stars 2.09k forks source link

How to implement custom Tokenizer #2715

Closed riccardobucco closed 2 years ago

riccardobucco commented 2 years ago

I'n trying to implement a custom tokenizer, but to get start positions of each token. Let me use an exact copy of your SegtokTokenizer to explain what's my issue:

from flair.data import Sentence
from flair.tokenization import SegtokTokenizer

s = Sentence("I'm Riccardo", use_tokenizer=SegtokTokenizer())
print(s.tokens[0].start_pos)

The output of this piece of code is 0. Now let me copy and paste the code of SegtokTokenizer:

from typing import List
from flair.data import Sentence, Tokenizer
from segtok.segmenter import split_single
from segtok.tokenizer import split_contractions, word_tokenizer

class SegtokTokenizer(Tokenizer):

    def __init__(self):
        super(SegtokTokenizer, self).__init__()

    def tokenize(self, text: str) -> List[str]:
        return SegtokTokenizer.run_tokenize(text)

    @staticmethod
    def run_tokenize(text: str) -> List[str]:
        words: List[str] = []

        sentences = split_single(text)
        for sentence in sentences:
            contractions = split_contractions(word_tokenizer(sentence))
            words.extend(contractions)

        words = list(filter(None, words))

        return words

s = Sentence("I'm Riccardo", use_tokenizer=SegtokTokenizer())
print(s.tokens[0].start_pos)

This is now printing None!! I don't understand what I'm missing here.

Of course my final goal is to implement my own tokenizer (I'm not going to use a copy of your tokenizer as I did here). But no matter what I do I always get None, even if I use a copy of your code. Please help me here.

riccardobucco commented 2 years ago

I solved it by installing flair from the repo instead of relying on published versions

alanakbik commented 2 years ago

We just released flair 0.11 so it should hopefully work now!

igormis commented 2 years ago

@riccardobucco can u tell me on this version did u install or give me some link, I have the same issue