Character n-grams - Githubissues

vboton commented 6 years ago

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

kaushikacharya commented 6 years ago

One way of doing this is assigning bigram as your key and bool True as its value:

features['fo'] = True features['oo'] = True features[''od'] = True

If you want to also consider position of the bigram, then it would be something like

features['fo_word_prefix'] = True features['oo_word_middle'] = True features['od_word_suffix'] = True

For reference https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb have a look at features['BOS'] = True in function word2features

On Fri, Jul 6, 2018 at 5:09 PM, yamivicen notifications@github.com wrote:

I have a training data where each token is a word and I've already extracted a few features like NER, POS and CHUNK for each token. But I have a problem when I try to extract character n-grams features. Since this features are computed at a character level, I don't know how to represent their values following the attribute value format. For example, if the current token is "food" then its character bigram feature will be something like "fo, oo, od". So how do I have to format that feature? By writing something like bigram[0]=fo, oo, od??

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chokkan/crfsuite/issues/103, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs1mCu1CR6MYNzgTeszY64FBu9zWTks5uD0yJgaJpZM4VFUot .

usptact commented 6 years ago

Take a look at Standford NLP NER features. These features are quite useful in morphologically rich languages like Finnsih, Turkish, Russian and others.

You can write word "food" prefixes like:

^f
^fo
^foo

And the suffixes:

d$#
od$#
ood$#

I don't remember the exact start and end flags but you get the idea.

chokkan / crfsuite

Character n-grams #103