Open sharon-gao opened 4 years ago
Hi @ShuangNYU,
Nice that you managed to extract the main topics of your own dataset.
Could you please share your code with us?
Me and a lot of others don't manage to get the output topic vector. #19 #4 #5
Hi @ShuangNYU,
Nice that you managed to extract the main topics of your own dataset.
Could you please share your code with us?
Me and a lot of others don't manage to get the output topic vector. #19 #4 #5
Hi @RoelTim ,
Glad to hear from you. I create my own formatted data by using the code in 'scripts / data_nyt.py'.
You can change the data_file to a path to your own dataset.
# Read data
print('reading text file...')
data_file = 'raw/new_york_times_text/nyt_docs.txt'
with open(data_file, 'r') as f:
docs = f.readlines()
And then just run this file. If there is any error, please tell me and perhaps I can help.
Besides, after finishing this and running the topic model, could you share your results whether there are a lot of repeated topics?
Hi, @ShuangNYU
Recently I am trying my own dataset using ETM and encounter the same question as you.(twitter dataset each row as a document)
Sample topics I get: Topic 7: ['government', 'stop', 'back', 'cari', 'great', 'coronavirusoutbreak', 'shit', 'hope', 'read'] Topic 8: ['back', 'stop', 'government', 'coronavirusoutbreak', 'cari', 'shit', 'ya', 'good', 'hai']
Is there any suggested solution? I tried to fix topic number but still the same result.
Hi, @ShuangNYU
I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.
Hi, @EJ0917
I managed to use data_nyt to create my own formatted data but failed to run it. guess I got some bugs. appreciate it if you could share your code. seems they changed main.py recently.
Same question ! I got all topics as the same one. Is there any suggested solution? @ShuangNYU
@ShuangNYU if you have access to NYT annotated corpus, could you give an access of this dataset tome, i also require access to this dataset but it is not freely available and i don't have much budget to get access to it.thanks
Hi,
Thanks for your interesting paper and this repository!
I tried train ETM on both 20ng and my own dataset with num_topics = 50.
Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).
Does anyone observe the same phenomenon?