Closed JelledFro closed 3 years ago
import spacy
from spacymoji import Emoji
nlp = spacy.load("en_core_web_sm")
emoji = Emoji(nlp, merge_spans=True)
nlp.add_pipe(emoji, first=True)
# case 1
doc = nlp('Tokenize tweets with two emojis ππ without space')
print([token.text for token in doc])
# case 2
doc = nlp('We are all same πΆ πΆπ» πΆπΌ πΆπ½ πΆπΎ πΆπΏ but different in skin colors!')
print([token.text for token in doc])
Expected Output
['Tokenize', 'tweets', 'with', 'two', 'emojis', 'π', 'π', 'without', 'space']
['We', 'are', 'all', 'same', 'πΆ', 'πΆπ»', 'πΆπΌ', 'πΆπ½', 'πΆπΎ', 'πΆπΏ', 'but', 'different', 'in', 'skin', 'colors', '!']
Sorry for taking a long time to get to this.
The syntax has changed a little with the recent spaCy v3 support but my output looks just like @abushoeb's.
import spacy
from spacymoji import Emoji
nlp = spacy.blank("en")
nlp.add_pipe("emoji", config={"merge_spans": True}, first=True)
#emoji = Emoji(nlp, merge_spans=True)
#nlp.add_pipe(emoji, first=True)
# case 1
doc = nlp('Tokenize tweets with two emojis ππ without space')
print([token.text for token in doc])
# case 2
doc = nlp('We are all same πΆ πΆπ» πΆπΌ πΆπ½ πΆπΎ πΆπΏ but different in skin colors!')
print([token.text for token in doc])
I'm going to close this because I can't reproduce the problem, but if you are still having trouble with it do let us know.
This is a pretty common way for people to use emojis so its unfortunate that for example ππ gets treated as on token instead of 2.