Open LongVu219 opened 8 months ago
Topic modeling is a technique used in Natural Language Processing (NLP) to discover abstract topics or themes within a collection of documents. It's an unsupervised learning approach that aims to identify patterns of word co-occurrence and group words together into topics based on their distribution across documents. One of the most commonly used algorithms for topic modeling is Latent Dirichlet Allocation (LDA), though Non-Negative Matrix Factorization (NMF) is also popular.
Document-Word Matrix: The first step in LDA is to represent the corpus as a document-word matrix, where each row represents a document and each column represents a word in the vocabulary. The entries of this matrix typically represent the frequency of each word in each document.
Initialization: LDA starts by randomly assigning each word in each document to one of the K topics, where K is a predefined number chosen by the user. These initial assignments serve as the starting point for the model.
Iterative Inference: LDA iteratively updates its estimates of two sets of latent variables: the topic distribution for each document and the word distribution for each topic. It does this by adjusting the assignments of words to topics based on the observed data and the current estimates of the latent variables.
Gibbs Sampling or Variational Inference: LDA typically employs either Gibbs sampling or variational inference to perform this iterative inference process. These techniques enable LDA to estimate the posterior distribution of the latent variables given the observed data.
Topic Interpretation: Once the model has converged, it assigns each word in the vocabulary a probability of belonging to each topic. These probabilities can be interpreted as the word's relevance to each topic. Additionally, each document is represented as a distribution over topics, indicating the likelihood of each topic's presence in the document.
Topic Extraction: Finally, the user can interpret the resulting topics by examining the most probable words associated with each topic. These words represent the key themes or concepts captured by the topic.
Clustering algorithms like K-means or hierarchical clustering can group similar posts together based on their textual content. Each cluster may represent a different theme or topic present in the posts.
Common Clustering Algorithms:
K-means: A simple and widely used clustering algorithm that partitions the data into K clusters based on the mean of the data points.
Hierarchical Clustering: Builds a hierarchy of clusters by recursively merging or splitting clusters based on their proximity.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together closely packed points based on density, without requiring a predefined number of clusters.
By clustering posts, we can identify groups of posts that share similar language, rhetoric, or themes related to rebellion or terrorism. These clusters can help in identifying patterns and trends within the dataset, potentially revealing hidden communities or networks of individuals promoting extremist ideologies.
The challenge with document clustering is determining the optimal number of clusters (K) and interpreting the clusters in a meaningful way. Additionally, some clustering algorithms may struggle with high-dimensional and sparse text data.
Dimensionality reduction techniques are used to reduce the number of features or dimensions in a dataset while preserving important information. In NLP, high-dimensional data such as text documents can be transformed into a lower-dimensional space, making it easier to visualize, analyze, and process.
Principal Component Analysis (PCA): Linear dimensionality reduction technique that identifies orthogonal axes (principal components) along which the variance of the data is maximized.
t-Distributed Stochastic Neighbor Embedding (t-SNE): Non-linear dimensionality reduction technique that preserves local structure by modeling pairwise similarities between data points.
Lower-dimensional representations obtained through dimensionality reduction can serve as input features for classification algorithms. By reducing the dimensionality of the text data, we can improve computational efficiency and potentially enhance the performance of classification models.
Choosing the appropriate number of dimensions or components for dimensionality reduction and ensuring that the lower-dimensional representations capture the relevant information in the text data.
AKA outlier detection → Help flag rebellious content/unusual hate speech, etc...
- Z-Score: calculating the z-score for each feature (e.g., word frequency)
- MAD (Median Absolute Deviation)
- Boxplot: visually represent the distribution of a dataset → outliers are data points that fall outside of the boxplot.
- Isolation Forest: ensemble method, collection of Decision Trees that randomly select features and split until anomalies are isolated into individual trees.
- One-Class SVM (Support Vector Machine): SVM with 1 class to make decision boundary.
- Autoencoders: Anomalies are detected as instances that have high reconstruction error.
Anomaly detection can help flag posts containing extremist rhetoric, hate speech, or violent language as potential rebellious or terrorist posts. By identifying anomalies in the dataset, we can prioritize the review and moderation of such posts by human moderators or law enforcement agencies.
Defining what constitutes an anomaly or outlier in the context of the dataset and selecting appropriate features for anomaly detection. Additionally, balancing the trade-off between minimizing false positives (flagging normal posts as anomalous) and false negatives (missing anomalous posts) is crucial.
code : ? input : ? output : ? training scheme : ?
Here is my LDA sample with data from dropped_embedding_300.csv: https://colab.research.google.com/drive/1qYpe5vMyQ32YUtMvFB2Q8Mal4GjS6uOJ?usp=sharing
@minhnn1