Closed ariansajina closed 3 years ago
Note that in https://euroleaks.diem25.org/leaks/mar17ewg/ "erm" is used to transcribe speech disfluency (e.g. "hmmm", "erm", etc.) and so needs to be added to stopwords, which was initially confusing, because the word is very frequent and could be an acronym for exchange rate mechanism (would make sense if we were talking Europe in the 90s, but not really Greece in 2015).
Use 3D plot for (document=date) to show progression of talks over time.
TFIDF successful in producing heatmap of keywords and showing differences between Euroleaks and Communiques, but not to cluster speakers or dates with NNMF thereafter, since reconstruction error decreases linearly with increasing number of dimensions (meaning no low-dimensional structure). Therefore need to try clustering with word embeddings.
Do this with document=speaker and document=date.