bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Fatal Python error: Segmentation fault when calling `HLDAModel.make_doc().get_topics()` #140

Closed dennishylau closed 2 years ago

dennishylau commented 2 years ago

Hi, thank you for your great work!

I have been experimenting with the HLDA model, and whenever I try to get the topics of a document, my Notebook kernel crashes.

System: MacOS 10.15.7, Python 3.9.4

Code:

hlda = tp.HLDAModel.load(some_model)
article: list[str] = Article.get(xxxx).doc
doc = hlda.make_doc(article)
doc.get_topics(top_n=10)

Stack Trace

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

VM Regions Near 0:
--> 
    __TEXT                 0000000105a9a000-0000000105d12000 [ 2528K] r-x/r-x SM=COW  /Users/USER/*/*.9

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib          0x00007fff7358233a __pthread_kill + 10
1   libsystem_pthread.dylib         0x00007fff7363ee60 pthread_kill + 430
2   libsystem_c.dylib               0x00007fff7349993e raise + 26
3   libsystem_platform.dylib        0x00007fff736335fd _sigtramp + 29
4   ???                             000000000000000000 0 + 0
5   _tomotopy_avx2.cpython-39-darwin.so 0x00000001064c5a4e tomoto::TopicModel<Eigen::Rand::ParallelRandomEngineAdaptor<unsigned int, Eigen::Rand::MersenneTwister<long long vector[4], 312, 156, 31, 13043109905998158313ull, 29, 6148914691236517205ull, 17, 8202884508482404352ull, 37, 18444473444759240704ull, 43, 6364136223846793005ull>, 8>, 4ul, tomoto::IHLDAModel, tomoto::HLDAModel<(tomoto::TermWeight)0, Eigen::Rand::ParallelRandomEngineAdaptor<unsigned int, Eigen::Rand::MersenneTwister<long long vector[4], 312, 156, 31, 13043109905998158313ull, 29, 6148914691236517205ull, 17, 8202884508482404352ull, 37, 18444473444759240704ull, 43, 6364136223846793005ull>, 8>, tomoto::IHLDAModel, void, tomoto::DocumentHLDA<(tomoto::TermWeight)0>, tomoto::ModelStateHLDA<(tomoto::TermWeight)0> >, tomoto::DocumentHLDA<(tomoto::TermWeight)0>, tomoto::ModelStateHLDA<(tomoto::TermWeight)0> >::getTopicsByDoc(tomoto::DocumentBase const*, bool) const + 14
6   _tomotopy_avx2.cpython-39-darwin.so 0x00000001064c5a83 tomoto::TopicModel<Eigen::Rand::ParallelRandomEngineAdaptor<unsigned int, Eigen::Rand::MersenneTwister<long long vector[4], 312, 156, 31, 13043109905998158313ull, 29, 6148914691236517205ull, 17, 8202884508482404352ull, 37, 18444473444759240704ull, 43, 6364136223846793005ull>, 8>, 4ul, tomoto::IHLDAModel, tomoto::HLDAModel<(tomoto::TermWeight)0, Eigen::Rand::ParallelRandomEngineAdaptor<unsigned int, Eigen::Rand::MersenneTwister<long long vector[4], 312, 156, 31, 13043109905998158313ull, 29, 6148914691236517205ull, 17, 8202884508482404352ull, 37, 18444473444759240704ull, 43, 6364136223846793005ull>, 8>, tomoto::IHLDAModel, void, tomoto::DocumentHLDA<(tomoto::TermWeight)0>, tomoto::ModelStateHLDA<(tomoto::TermWeight)0> >, tomoto::DocumentHLDA<(tomoto::TermWeight)0>, tomoto::ModelStateHLDA<(tomoto::TermWeight)0> >::getTopicsByDocSorted(tomoto::DocumentBase const*, unsigned long) const + 35
7   _tomotopy_avx2.cpython-39-darwin.so 0x0000000106894f89 Document_getTopics(DocumentObject*, _object*, _object*) + 185
8   python                          0x0000000105b21a45 cfunction_call + 69
9   python                          0x0000000105ae0757 _PyObject_MakeTpCall + 375
10  python                          0x0000000105bc8340 call_function + 624
11  python                          0x0000000105bc5452 _PyEval_EvalFrameDefault + 28002
12  python                          0x0000000105bc9134 _PyEval_EvalCode + 2852
13  python                          0x0000000105bbe620 PyEval_EvalCode + 64
14  python                          0x0000000105c0e90d pyrun_file + 333
15  python                          0x0000000105c0c9c9 PyRun_SimpleFileExFlags + 729
16  python                          0x0000000105c2b973 Py_RunMain + 2067
17  python                          0x0000000105c2bea3 pymain_main + 403
18  python                          0x0000000105c2befb Py_BytesMain + 43
19  libdyld.dylib                   0x00007fff7343acc9 start + 1

Please let me know if there is anything else I can do to help with debugging, thank you.

Update 1: get_topic_dist()also crashes.

bab2min commented 2 years ago

Hi @dennishylau Thank you for reporting a bug. When you run make_doc(), it creates a document without any topic assignment. So calling get_topics() in this situation will not give you proper result. You need to call infer() to estimate the distribution of topics in the doc before calling get_topics().

doc = hlda.make_doc(article)
hlda.infer(doc) # doc should be inferred first
doc.get_topics(top_n=10)

I'll fix crashes when calling get_topics() or get_topic_dist() and add a warning message to call infer first.

dennishylau commented 2 years ago

Hi @bab2min, wow thank you for the prompt reply! That makes perfect sense, what a silly mistake on my end. Closing this issue now, have a nice day!