Open mikkokotila opened 12 months ago
Could you please provide a runnable example to show this? It's possible the tokenizer is merging those two words into a single token, or Scattertext ended up aligning them in the labeling phase.
But please submit a reproducible example of where this occurs. Otherwise, there's nothing I can do to look into this.
Just wanted to first know if there is a possibility for something like that happening which you are aware of.
The small background is that I'm very familiar with the tokenizer I'm using (botok
), and am 100% sure it is not the cause of this as I've stared at material tokenized by it for thousands of hours.
Here is the data I'm using:
Here is the one-liner to read it to ensure consistency with the way I have it:
open('tibetan_strings.txt', 'r').readlines()[0].split(' ')
Here is the wrapper for the tokenizer I'm using, inspired by the the chinese_nlp
example:
import re
from botok import WordTokenizer
tokenizer = WordTokenizer()
class Tok(object):
def __init__(self, pos, lem, orth, low, ent_type, tag):
self.pos_ = pos
self.lemma_ = lem
self.lower_ = low
self.orth_ = orth
self.ent_type_ = ent_type
self.tag_ = tag
def __repr__(self): return self.orth_
def __str__(self): return self.orth_
class Doc(object):
def __init__(self, sents, raw):
self.sents = sents
self.string = raw
self.text = raw
def __str__(self):
return ' '.join(str(sent) for sent in self.sents)
def __repr__(self):
return self.__str__()
def __iter__(self):
for sent in self.sents:
for tok in sent:
yield tok
class Sentence(object):
def __init__(self, toks, raw):
self.toks = toks
self.raw = raw
def __iter__(self):
for tok in self.toks:
yield tok
def __str__(self):
return ' '.join([str(tok) for tok in self.toks])
def __repr__(self):
return self.raw
import bokit
punct_list = bokit.utils.create_punctuation_list()
punct_str = "|".join(map(re.escape, punct_list)) # Escape special characters
punct_re = re.compile(r'^({})+$'.format(punct_str)) # Create the regex pattern
def tibetan_nlp(doc, entity_type=None, tag_type=None):
toks = []
for tok_obj in tokenizer.tokenize(doc):
tok = tok_obj['text_unaffixed']
pos = tok_obj['pos']
if tok.strip() == '':
pos = 'SPACE'
elif punct_re.match(tok):
pos = 'PUNCT'
token = Tok(pos,
tok_obj['lemma'],
tok.lower(),
tok,
ent_type='' if entity_type is None else entity_type.get(tok, ''),
tag='' if tag_type is None else tag_type.get(tok, ''))
toks.append(token)
return Doc([Sentence(toks, doc)], doc)
I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?
I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?
Yes
I have many cases where two tokens such as བྱང་ཆུབ་ and སེམས་དཔ become a single thing in the scatterplot. Is this something that ScatterText is doing? The tokenizer I'm using does not do that.