JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.
Apache License 2.0
2.23k stars 287 forks source link

Does ScatterText somehow combine tokens? #132

Open mikkokotila opened 12 months ago

mikkokotila commented 12 months ago

I have many cases where two tokens such as བྱང་ཆུབ་ and སེམས་དཔ become a single thing in the scatterplot. Is this something that ScatterText is doing? The tokenizer I'm using does not do that.

JasonKessler commented 12 months ago

Could you please provide a runnable example to show this? It's possible the tokenizer is merging those two words into a single token, or Scattertext ended up aligning them in the labeling phase.

JasonKessler commented 12 months ago

But please submit a reproducible example of where this occurs. Otherwise, there's nothing I can do to look into this.

mikkokotila commented 12 months ago

Just wanted to first know if there is a possibility for something like that happening which you are aware of.

The small background is that I'm very familiar with the tokenizer I'm using (botok), and am 100% sure it is not the cause of this as I've stared at material tokenized by it for thousands of hours.

Here is the data I'm using:

tibetan_strings.txt

Here is the one-liner to read it to ensure consistency with the way I have it:

open('tibetan_strings.txt', 'r').readlines()[0].split(' ')

Here is the wrapper for the tokenizer I'm using, inspired by the the chinese_nlp example:

import re
from botok import WordTokenizer
tokenizer = WordTokenizer()

class Tok(object):

    def __init__(self, pos, lem, orth, low, ent_type, tag):
        self.pos_ = pos
        self.lemma_ = lem
        self.lower_ = low
        self.orth_ = orth
        self.ent_type_ = ent_type
        self.tag_ = tag

    def __repr__(self): return self.orth_

    def __str__(self): return self.orth_

class Doc(object):

    def __init__(self, sents, raw):
        self.sents = sents
        self.string = raw
        self.text = raw

    def __str__(self):
        return ' '.join(str(sent) for sent in self.sents)

    def __repr__(self):
        return self.__str__()

    def __iter__(self):
        for sent in self.sents:
            for tok in sent:
                yield tok

class Sentence(object):

    def __init__(self, toks, raw):
        self.toks = toks
        self.raw = raw

    def __iter__(self):
        for tok in self.toks:
            yield tok

    def __str__(self):
        return ' '.join([str(tok) for tok in self.toks])

    def __repr__(self):
        return self.raw

import bokit
punct_list = bokit.utils.create_punctuation_list()

punct_str = "|".join(map(re.escape, punct_list))  # Escape special characters
punct_re = re.compile(r'^({})+$'.format(punct_str))  # Create the regex pattern

def tibetan_nlp(doc, entity_type=None, tag_type=None):

    toks = []

    for tok_obj in tokenizer.tokenize(doc):
        tok = tok_obj['text_unaffixed']
        pos = tok_obj['pos']

        if tok.strip() == '':
            pos = 'SPACE'
        elif punct_re.match(tok):
            pos = 'PUNCT'

        token = Tok(pos,
                    tok_obj['lemma'],
                    tok.lower(),
                    tok,
                    ent_type='' if entity_type is None else entity_type.get(tok, ''),
                    tag='' if tag_type is None else tag_type.get(tok, ''))

        toks.append(token)

    return Doc([Sentence(toks, doc)], doc)
JasonKessler commented 12 months ago

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

mikkokotila commented 11 months ago

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

Yes