a question about topic model

RowitZou / topic-dialog-summ

AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.

MIT License

77 stars 9 forks source link

a question about topic model #32

Closed muguruzawang closed 2 years ago

muguruzawang commented 2 years ago

I train the base GSM topic model on my own training data, but find that after some iterations, the topic model performs poorly as below:

Do you have any idea about the reason why topic model performs so poorly?

RowitZou commented 2 years ago

Do you mean the topic model in my code? or something else? What does GSM exactly mean? Could you provide your code?

muguruzawang commented 2 years ago

I use the topic model implementation in https://github.com/RowitZou/topic-dialog-summ/blob/0de31d97b07be4004e08f9755ee66bea47aa7b10/src/models/topic.py#L7

I train the topic model on scientific paper dataset and the BOW value is the same as https://github.com/RowitZou/topic-dialog-summ/blob/0de31d97b07be4004e08f9755ee66bea47aa7b10/src/models/data_loader.py#L117

but the trained topic is not desired.

I wonder what's your preprocessing steps towards Bag-of-words construction and what special considerations needed for training topic model?

Thanks.

RowitZou commented 2 years ago

In my preprocessing procedure, I filtered out stop words, which have a high frequency in all documents. In your case, does 'clustering' or 'flooding' appear in most of the documents?
Before training your topic model, the data should be randomly shuffled. The batch size could be set to larger. The purpose is to avoid shifting the topic model to a set of documents with a single topic, e.g., 'clustering'.
The learning rate and the number of topics are 2 crucial hyper-parameters, which should be carefully chosen. In my scenario, the topic number of customer service is relatively small, such as price, ordering, account, etc. But I think the scientific papers would have more topics.
The length of the input document is an important feature. In the dialogue, the BOW representations are sparse, about 70%-80% of positions are 0.
If the top-10 words cannot show differences, how about considering and analyzing the top-20~top-50 words?

muguruzawang commented 2 years ago

Thanks for your reply.

Actually I have filtered out the stopwords. the document frequency of "clustering" and "flooding" is 3638/143573，10/143573. I also set the batch size as 1024, and initial learning rate as 0.0001, and train the model for more than 1000 epochs, but the topic model still tends to allocate the same words to most topic.

I have another question that, I find your topic model loss coefficient is set as 1e-3, which makes the topic loss is greatly smaller than the summarization loss, how to ensure that the topic model is fully trained?

RowitZou commented 2 years ago

Have you used the SATM or just the base NTM?
The total training steps are 80000 in my setting, accompanied by the summarization model. The coefficient is chosen as 0.001 so the actual value of the topic loss is about 1/10 of the MLE loss.

RowitZou commented 2 years ago

You could output the topic word list every 10 steps to analyze the change during training. Try more combinations of different learning rates and topic numbers.

muguruzawang commented 2 years ago

Have you used the SATM or just the base NTM?

The total training steps are 80000 in my setting, accompanied by the summarization model. The coefficient is chosen as 0.001 so the actual value of the topic loss is about 1/10 of the MLE loss.

I haven't tried SATM. I will check the model and my training process further, and maybe try some other topic models. Thanks again.