Doesn't seem to split multiple emojis occurring in sequence without spaces between.

explosion / spacymoji

💙 Emoji handling and meta data for spaCy with custom extension attributes

https://spacy.io

MIT License

181 stars 20 forks source link

Doesn't seem to split multiple emojis occurring in sequence without spaces between. #9

Closed JelledFro closed 3 years ago

JelledFro commented 4 years ago

This is a pretty common way for people to use emojis so its unfortunate that for example 😄😄 gets treated as on token instead of 2.

abushoeb commented 4 years ago

import spacy
from spacymoji import Emoji

nlp = spacy.load("en_core_web_sm")
emoji = Emoji(nlp, merge_spans=True)
nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis 😄😄 without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same 👶 👶🏻 👶🏼 👶🏽 👶🏾 👶🏿 but different in skin colors!')
print([token.text for token in doc])

Expected Output

['Tokenize', 'tweets', 'with', 'two', 'emojis', '😄', '😄', 'without', 'space']
['We', 'are', 'all', 'same', '👶', '👶🏻', '👶🏼', '👶🏽', '👶🏾', '👶🏿', 'but', 'different', 'in', 'skin', 'colors', '!']

polm commented 3 years ago

Sorry for taking a long time to get to this.

The syntax has changed a little with the recent spaCy v3 support but my output looks just like @abushoeb's.

import spacy
from spacymoji import Emoji

nlp = spacy.blank("en")
nlp.add_pipe("emoji", config={"merge_spans": True}, first=True)
#emoji = Emoji(nlp, merge_spans=True)
#nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis 😄😄 without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same 👶 👶🏻 👶🏼 👶🏽 👶🏾 👶🏿 but different in skin colors!')
print([token.text for token in doc])

I'm going to close this because I can't reproduce the problem, but if you are still having trouble with it do let us know.