Open yokyoku-taikan opened 1 year ago
Hi @yokyoku-taikan,
thank you for reporting this. I'm afraid the regular expression we use to tokenize the text is not ideal for CJK texts. You could try to add a token_pattern
keyword argument here with your own regex that overwrites the default pattern \p{L}+\p{Connector_Punctuation}?\p{L}+
:
If this even makes sense for CJK texts since it is way more complex to tokenize than e.g. English (at least as far as I know).
Alternatively you could try overwriting the tokenization of the Document
class that is used for further processing, maybe something like:
from cophi.text.model import Document as _Document
def custom_cjk_tokenizeation(text):
# maybe use https://github.com/fxsjy/jieba
...
class Document(_Document):
def __init__(
text,
title=None,
lowercase=True,
n=None,
maximum=None,
):
self.text = text
self.title = title
self.lowercase = lowercase
if n is not None and n < 1:
raise ValueError(f"Arg 'n' must be greater than {n}.")
self.n = n
self.maximum = maximum
self.tokens = custom_cjk_tokenizeation(text)
if self.lowercase:
self.tokens = utils.lowercase_tokens(self.tokens)
Note that these changes require to run the whole application locally and not using the provided executables.
Hello,
I am currently trying to use TopicsExplorer with a corpus of CJK texts (all of the .txt files are UTF-8 encoded). After I click on "Train topic model" the program outputs the following error message:
Closing connection to database... Fetching stopwords... Cleaning corpus... Connecting to database... Insert stopwords into database... Closing connection to database... Successfully preprocessed data. Connecting to database... Insert token frequencies into database... Closing connection to database... Creating topic model... n_documents: 16 vocab_size: 0 n_words: 0 n_topics: 10 n_iter: 100 all zero row in document-term matrix found ERROR: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' Redirect to error page...
It seems to me as though TopicsExplorer is unable to recognise CJK tokens/words (cf. vocab_size: 0; n_words: 0). Is there a workaround for this problem?
Thank you in advance!