bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Inference against a corpus is segfaulting #181

Open erip opened 1 year ago

erip commented 1 year ago

I am migrating away from model.make_doc to tp.util.Corpus and am finding that using Corpus segfaults. My tiny repro is here:

#!/usr/bin/env python3

import time

import numpy as np
import tomotopy as tp

# Workaround for `str.split` received unknown kwarg user_data
class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

def get_highest_lda_list(model, N, docs):
    corpus = [model.make_doc(doc.split()) for doc in docs]
    topic_dist, ll = model.infer(corpus)
    k = np.argmax(topic_dist, axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

def get_highest_lda_corpus(model, N, docs):
    corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
    corpus.process(doc for doc in docs)
    topic_dist, ll = model.infer(corpus)
    k = np.argmax([doc.get_topic_dist() for doc in topic_dist], axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

if __name__ == "__main__":
    docs = [line.strip() for line in open('10_line_pretokenized_corpus.tsv')]
    lda = tp.LDAModel.load('tm_model.bin')
    N = 10
    t0 = time.time()
    list_res = get_highest_lda_list(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (list)")
    corpus_res = get_highest_lda_corpus(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (corpus)")
    t0 = time.time()
    assert all(e == f for e, f in zip(corpus_res, list_res))

When I run this, I see:

Took 19.61503529548645 seconds (list)
Segmentation fault (core dumped)

Running this with catchsegv shows these relevant lines:

/usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f171bd43210]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_ZNSt6vectorIjSaIjEE12emplace_backIJRjEEEvDpOT_+0x7c)[0x7f16d662705c]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z10makeCorpusP16TopicModelObjectP7_objectS2_+0x681)[0x7f16d6db8f51]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z9LDA_inferP16TopicModelObjectP7_objectS2_+0x25a)[0x7f16d6d71b8a]

which seems to point here... maybe d.get() is null?

erip commented 1 year ago

It seems like my WSTok is the issue and that it doesn't meet the expected interface (__call__ should return (tok, start, stop)). If I use tp.util.SimpleTokenizer(pattern="\w+"), it seems to be OK... this is somewhat unexpected, though so maybe documentation can be slightly improved.

bab2min commented 1 year ago

Hi, @erip Could you share some pieces of the file 10_line_pretokenized_corpus.tsv for reproducing? A similar error is not reproduced in the sample text I have, so it is not easy to determine the cause. If you share the file where the problem is reproduced, it will be of great help to find the cause.

erip commented 1 year ago

@bab2min are you using the WSTok here? It should cause the error

bab2min commented 1 year ago

Ooops sorry @erip , I forgot this feed entirely. Yes, I used WSTok and it worked well. Since I don't have tm_model.bin and 10_line_pretokenized_corpus.tsv, I ran the code, which is modifed like:

class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

docs = ["this is test text", "this is another text", "somewhat long text...."]

corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
for doc in corpus:
    print(doc)
# it will print
# <tomotopy.Document with words="this is test text">
# <tomotopy.Document with words="this is another text">
# <tomotopy.Document with words="somewhat long text....">

I suspect that some lines in the 10_line_pretokenized_corpus.tsv corrupt the inner c++ code.