bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

Question: Calculating Coherence. What words are expected as Targets? #121

Open hhagedorn opened 3 years ago

hhagedorn commented 3 years ago

Hello @bab2min,

I am trying to use your implementation of the C_v coherence measure to evaluate both topic models that are included in tomotopy and some that are not. Therefore I generated a tomotpy.utils.Corpus to initialise the .Coherence class.

But I am a little confused with the targets parameter. Does it expect the whole vocabulary of the Corpus (or at least the vocabulary that is relevant for the coherence, e.g. all words from LDAModel.used_vocabs) or only a set of words that I want to later check for coherence (e.g. all words in my to-be evaluated topics)?

I am not exactly sure how to understand the sentence "Only words that are provided as targets are included in probability estimation."

Thank you already in advance!

bab2min commented 3 years ago

Hi @hhagedorn, Sorry for the confusion due to the unclear documentation. For targets, the latter is correct. In other words, you just pass a set of words in to-be evaluated topics as targets.

The reason why targets is required is for computational efficiency. Calculating co-occurrence of all words from LDAModel.used_vocabs consumes a lot of time and memory. If you know the words to be evaluated for coherence, it can calculate their co-occurrences only instead of all. For this purpose, Coherence provides targets argument.

I'll supplement this explanation to the documentation in the next update. Thank you for your good question!

benreaves commented 2 years ago

Hello @bab2min - thank you for the time you put into maintaining tomotopy!

I'm having some trouble that might be similar to @hhagedorn : I'm calculating the c_v coherence on a model that had earlier been trained and saved to disk, like this:

mdl = tomotopy.LDAModel.load("saved_model.bin")
coh = tomotopy.coherence.Coherence(mdl, coherence='c_v')

On the second line, I'm not specifying targets value, only the model. I understand it might be slow because of the large number of targets (about 20000 unique tokens), but my concern is that it sometimes crashes and hangs, even with the same model on the same machine. If I specify u_mass, then it calculates the coherence within a few minutes, but c_v stops for hours. Sometimes it crashes with just "Killed" and sometimes I see bad_alloc. So I suppose it's deep inside the coherence. I run it under mprof (memory profiler) and it uses only about 1.1GB, nowhere near the memory limit. I get different behavior at different times on the same model, same machine.

tomotopy.isa returns 'avx2' and I am using an intel i7-11800H, python 3.8.10, ubuntu 20.04 on WSL2 under Windows 11. I get similar behavior when running on GCP or AWS. What would you recommend here?

Thank you!

bab2min commented 2 years ago

Hi @benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

benreaves commented 2 years ago

Yes I will send it later today. Thank you for investigating!

On Thu, Feb 3, 2022, 08:06 Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

benreaves commented 2 years ago

Yes, here it is! [1] The zip file contains

[1] https://drive.google.com/file/d/1s9WBQ_dxHV55qpy-mzSyB1tGpPuX7mhG/view?usp=sharing

On Thu, Feb 3, 2022 at 8:22 AM Ben Reaves @.***> wrote:

Yes I will send it later today. Thank you for investigating!

On Thu, Feb 3, 2022, 08:06 Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--


Ben Reaves

--

benreaves commented 2 years ago

BTW, it doesn't always give the same error - sometimes it's "bad_alloc" sometimes it just says "Killed" and exits with no traceback, and sometimes it just hangs for at least 8 hours. I really appreciate your looking into this!

On Thu, Feb 3, 2022 at 8:06 AM Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--


Ben Reaves

--

bab2min commented 2 years ago

@benreaves Thank you for sharing the files and details. I'll look into them!

benreaves commented 2 years ago

This issue is no longer important. Reasons:

  1. c_npmi seems to work fine, so I can use that instead of c_v
  2. c_v should be avoided, according to this serious issue from 2018: https://github.com/dice-group/Palmetto/issues/13

However, I am still having a numerical problem in add_doc() but it belongs in a new thread: #159