explosion / spacymoji

πŸ’™ Emoji handling and meta data for spaCy with custom extension attributes
MIT License
181 stars 20 forks source link

Doesn't seem to split multiple emojis occurring in sequence without spaces between. #9

Closed JelledFro closed 3 years ago

JelledFro commented 4 years ago

This is a pretty common way for people to use emojis so its unfortunate that for example πŸ˜„πŸ˜„ gets treated as on token instead of 2.

abushoeb commented 4 years ago
import spacy
from spacymoji import Emoji

nlp = spacy.load("en_core_web_sm")
emoji = Emoji(nlp, merge_spans=True)
nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis πŸ˜„πŸ˜„ without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same πŸ‘Ά πŸ‘ΆπŸ» πŸ‘ΆπŸΌ πŸ‘ΆπŸ½ πŸ‘ΆπŸΎ πŸ‘ΆπŸΏ but different in skin colors!')
print([token.text for token in doc])

Expected Output

['Tokenize', 'tweets', 'with', 'two', 'emojis', 'πŸ˜„', 'πŸ˜„', 'without', 'space']
['We', 'are', 'all', 'same', 'πŸ‘Ά', 'πŸ‘ΆπŸ»', 'πŸ‘ΆπŸΌ', 'πŸ‘ΆπŸ½', 'πŸ‘ΆπŸΎ', 'πŸ‘ΆπŸΏ', 'but', 'different', 'in', 'skin', 'colors', '!']
polm commented 3 years ago

Sorry for taking a long time to get to this.

The syntax has changed a little with the recent spaCy v3 support but my output looks just like @abushoeb's.

import spacy
from spacymoji import Emoji

nlp = spacy.blank("en")
nlp.add_pipe("emoji", config={"merge_spans": True}, first=True)
#emoji = Emoji(nlp, merge_spans=True)
#nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis πŸ˜„πŸ˜„ without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same πŸ‘Ά πŸ‘ΆπŸ» πŸ‘ΆπŸΌ πŸ‘ΆπŸ½ πŸ‘ΆπŸΎ πŸ‘ΆπŸΏ but different in skin colors!')
print([token.text for token in doc])

I'm going to close this because I can't reproduce the problem, but if you are still having trouble with it do let us know.