DARIAH-DE / TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.
https://dariah-de.github.io/TopicsExplorer
Apache License 2.0
62 stars 10 forks source link

n_words: 0 with CJK texts #133

Open yokyoku-taikan opened 1 year ago

yokyoku-taikan commented 1 year ago

Hello,

I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded). After I click on "Train topic model" the program outputs the following error message:

Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...

It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?

Thank you in advance!

severinsimmler commented 12 months ago

Hi @yokyoku-taikan,

thank you for reporting this. I'm afraid the regular expression we use to tokenize the text is not ideal for CJK texts. You could try to add a token_pattern keyword argument here with your own regex that overwrites the default pattern \p{L}+\p{Connector_Punctuation}?\p{L}+:

https://github.com/DARIAH-DE/TopicsExplorer/blob/8d04d3c0e10e25c9777cfe5addaf2cc79d92236a/topicsexplorer/utils.py#L117

If this even makes sense for CJK texts since it is way more complex to tokenize than e.g. English (at least as far as I know).

Alternatively you could try overwriting the tokenization of the Document class that is used for further processing, maybe something like:

from cophi.text.model import Document as _Document

def custom_cjk_tokenizeation(text):
    # maybe use https://github.com/fxsjy/jieba
    ...

class Document(_Document):
    def __init__(
        text,
        title=None,
        lowercase=True,
        n=None,
        maximum=None,
    ):
        self.text = text
        self.title = title
        self.lowercase = lowercase
        if n is not None and n < 1:
            raise ValueError(f"Arg 'n' must be greater than {n}.")
        self.n = n
        self.maximum = maximum
        self.tokens = custom_cjk_tokenizeation(text)
        if self.lowercase:
            self.tokens = utils.lowercase_tokens(self.tokens)

Note that these changes require to run the whole application locally and not using the provided executables.