ariansajina / master-thesis

MIT License
1 stars 0 forks source link

Analysis: document clustering with TFIDF and NNMF #15

Closed ariansajina closed 3 years ago

ariansajina commented 3 years ago

Do this with document=speaker and document=date.

ariansajina commented 3 years ago

Note that in https://euroleaks.diem25.org/leaks/mar17ewg/ "erm" is used to transcribe speech disfluency (e.g. "hmmm", "erm", etc.) and so needs to be added to stopwords, which was initially confusing, because the word is very frequent and could be an acronym for exchange rate mechanism (would make sense if we were talking Europe in the 90s, but not really Greece in 2015).

ariansajina commented 3 years ago

Use 3D plot for (document=date) to show progression of talks over time.

ariansajina commented 3 years ago

TFIDF successful in producing heatmap of keywords and showing differences between Euroleaks and Communiques, but not to cluster speakers or dates with NNMF thereafter, since reconstruction error decreases linearly with increasing number of dimensions (meaning no low-dimensional structure). Therefore need to try clustering with word embeddings.