explosion / spacymoji

πŸ’™ Emoji handling and meta data for spaCy with custom extension attributes
https://spacy.io
MIT License
181 stars 20 forks source link

Doesn't seem to split multiple emojis occurring in sequence without spaces between. #9

Closed JelledFro closed 3 years ago

JelledFro commented 4 years ago

This is a pretty common way for people to use emojis so its unfortunate that for example πŸ˜„πŸ˜„ gets treated as on token instead of 2.

abushoeb commented 4 years ago
import spacy
from spacymoji import Emoji

nlp = spacy.load("en_core_web_sm")
emoji = Emoji(nlp, merge_spans=True)
nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis πŸ˜„πŸ˜„ without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same πŸ‘Ά πŸ‘ΆπŸ» πŸ‘ΆπŸΌ πŸ‘ΆπŸ½ πŸ‘ΆπŸΎ πŸ‘ΆπŸΏ but different in skin colors!')
print([token.text for token in doc])

Expected Output

['Tokenize', 'tweets', 'with', 'two', 'emojis', 'πŸ˜„', 'πŸ˜„', 'without', 'space']
['We', 'are', 'all', 'same', 'πŸ‘Ά', 'πŸ‘ΆπŸ»', 'πŸ‘ΆπŸΌ', 'πŸ‘ΆπŸ½', 'πŸ‘ΆπŸΎ', 'πŸ‘ΆπŸΏ', 'but', 'different', 'in', 'skin', 'colors', '!']
polm commented 3 years ago

Sorry for taking a long time to get to this.

The syntax has changed a little with the recent spaCy v3 support but my output looks just like @abushoeb's.

import spacy
from spacymoji import Emoji

nlp = spacy.blank("en")
nlp.add_pipe("emoji", config={"merge_spans": True}, first=True)
#emoji = Emoji(nlp, merge_spans=True)
#nlp.add_pipe(emoji, first=True)

# case 1
doc = nlp('Tokenize tweets with two emojis πŸ˜„πŸ˜„ without space')
print([token.text for token in doc])

# case 2
doc = nlp('We are all same πŸ‘Ά πŸ‘ΆπŸ» πŸ‘ΆπŸΌ πŸ‘ΆπŸ½ πŸ‘ΆπŸΎ πŸ‘ΆπŸΏ but different in skin colors!')
print([token.text for token in doc])

I'm going to close this because I can't reproduce the problem, but if you are still having trouble with it do let us know.