Open vboton opened 6 years ago
One way of doing this is assigning bigram as your key and bool True as its value:
features['fo'] = True features['oo'] = True features[''od'] = True
If you want to also consider position of the bigram, then it would be something like
features['fo_word_prefix'] = True features['oo_word_middle'] = True features['od_word_suffix'] = True
For reference https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb have a look at features['BOS'] = True in function word2features
On Fri, Jul 6, 2018 at 5:09 PM, yamivicen notifications@github.com wrote:
I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chokkan/crfsuite/issues/103, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs1mCu1CR6MYNzgTeszY64FBu9zWTks5uD0yJgaJpZM4VFUot .
Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others.
You can write word "food" prefixes like:
And the suffixes:
I don't remember the exact start and end flags but you get the idea.
I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??