adjidieng / ETM

Topic Modeling in Embedding Spaces
MIT License
540 stars 127 forks source link

Is that true that a lot of repeated topics appear? #23

Open sharon-gao opened 3 years ago

sharon-gao commented 3 years ago

Hi,

Thanks for your interesting paper and this repository!

I tried train ETM on both 20ng and my own dataset with num_topics = 50.

Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).

Does anyone observe the same phenomenon?

RoelTim commented 3 years ago

Hi @ShuangNYU,

Nice that you managed to extract the main topics of your own dataset.

Could you please share your code with us?

Me and a lot of others don't manage to get the output topic vector. #19 #4 #5

sharon-gao commented 3 years ago

Hi @ShuangNYU,

Nice that you managed to extract the main topics of your own dataset.

Could you please share your code with us?

Me and a lot of others don't manage to get the output topic vector. #19 #4 #5

Hi @RoelTim ,

Glad to hear from you. I create my own formatted data by using the code in 'scripts / data_nyt.py'. You can change the data_file to a path to your own dataset. # Read data print('reading text file...') data_file = 'raw/new_york_times_text/nyt_docs.txt' with open(data_file, 'r') as f: docs = f.readlines() And then just run this file. If there is any error, please tell me and perhaps I can help.

Besides, after finishing this and running the topic model, could you share your results whether there are a lot of repeated topics?

EJ0917 commented 3 years ago

Hi, @ShuangNYU

Recently I am trying my own dataset using ETM and encounter the same question as you.(twitter dataset each row as a document)

Sample topics I get: Topic 7: ['government', 'stop', 'back', 'cari', 'great', 'coronavirusoutbreak', 'shit', 'hope', 'read'] Topic 8: ['back', 'stop', 'government', 'coronavirusoutbreak', 'cari', 'shit', 'ya', 'good', 'hai']

Is there any suggested solution? I tried to fix topic number but still the same result.

lw081701019 commented 3 years ago

Hi, @ShuangNYU

I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.

lw081701019 commented 3 years ago

Hi, @EJ0917

I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.

liuh236 commented 2 years ago

Same question ! I got all topics as the same one. Is there any suggested solution? @ShuangNYU

asma-ui commented 1 year ago

@ShuangNYU if you have access to NYT annotated corpus, could you give an access of this dataset tome, i also require access to this dataset but it is not freely available and i don't have much budget to get access to it.thanks