Closed drew2a closed 7 months ago
We deployed Network Buzz within Tribler in 2010. This is what Reddit had to say then about it: Nothing has changed as of today. Thanks to the "network buzz" feature (which you can't turn off) it almost never goes below 10% CPU utilization and sometimes just sits at 50% maxing out one of my 2 cores.
Related prior work is the tag systems since 2011, we lack the user community for this. MusicBrainz has volunteers with over 1 million edit/tagging contributions, that is our leading example.
Please do a 10-day prototype @drew2a. You now did top-down design exploration. Time for bottom-up "learn-by-doing". We do not know if "content bundling" can be done exclusive and perfect with local heuristics and zero database changes. Or we need to store, gather rich metadata and offer content enrichment plus database changes. What about the near-duplicates we studied for years (never could fix)?
Lets leave the anti-spam for future sprints :exclamation: No need to distract @grimadas from deployment and debugging the low-level rendezvous component. Only in 2025 we will re-visit the Justin Bieber is gay
, tag spam problem.
The first attempt at trying to group search results locally doesn't offer much hope, as it tends to group quite random torrents together without organizing them into the same content group.
The developed script:
Expanding on the same idea: what if instead of searching for all similar entries in the search results, we look for entries similar only to the first (most relevant) result?
The developed script:
But the results of the experiment are still far from ideal and even from a minimally viable product.
We learned something! I genuinely don't find it a bad start.
Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?
I've modified the original script to address the question "Seems number signal is thrown away?" and to print all TF-IDF values for each term.
Indeed, it was discovered that certain digits were being ignored, specifically those consisting of a single character. This occurred because the vectorizer, by default, disregards all terms that are composed of only one character.
In the example below the number 5 are ignored:
Original: ubuntu-12.04.5-desktop-i386.iso
Preprocessed: ubuntu 12 04 5 desktop i386 iso
TF-IDF:
12: 0.668
i386: 0.530
desktop: 0.318
04: 0.275
iso: 0.248
ubuntu: 0.185
I've fix it and also added more output to discern the terms around which titles were grouped into clusters, I analyzed the centroids of the clusters determined by the K-means algorithm. The centroids represent the "center" or "mean" vector of each cluster in the feature space, essentially capturing the average importance of each term within the cluster. By examining these centroids, we can identify which terms have the highest TF-IDF values across the documents in a cluster, giving us insight into the thematic essence of each cluster.
This information should provide us with a better understanding of the details involved in grouping items into clusters.
Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?
I've slightly modified the pattern for the TfidfVectorizer
from (?u)\b\w+\b
to (?u)\b\d+\b
(which means "use only digits"), and obtained results quite close to what was requested:
In the previous example, elements were grouped fairly well, but there was a cluster containing elements close to noise (Cluster 4). To identify such a cluster (and filter it out in the future), we attempted to calculate the intra-cluster dispersion. This approach aimed to quantify the cohesion within each cluster by measuring the average distance of points from their cluster centroid. The rationale behind this method is that a cluster with a higher average distance among its points might be less cohesive and potentially contain more noise, making it a candidate for exclusion from further analysis.
To implement this, we first grouped the indices of the elements belonging to each cluster. Then, for each cluster, we constructed a matrix of its points by vertically stacking the corresponding rows from the TF-IDF matrix X using the indices we had collected. This allowed us to calculate pairwise distances between each point in a cluster and its centroid, using the pairwise_distances function. By computing the mean of these distances, we obtained a measure of intra-cluster dispersion.
The calculated average distances provided a clear metric to assess the tightness of each cluster. Clusters with lower average distances were deemed more cohesive, indicating that their elements were closely related to each other and to the cluster's overall theme. Conversely, clusters with higher average distances were scrutinized for potential exclusion, as their wide dispersion suggested a lack of a unifying theme or the presence of outlier elements. This methodological adjustment offered a systematic way to identify and potentially remove clusters that detract from the clarity and relevance of the clustering outcome, thereby refining the analysis.
Another metric that can be utilized for filtering clusters is the silhouette coefficient(the coefficient values range from -1 to 1). This metric provides insight into the distance between clusters and the cohesion within them. By calculating the silhouette coefficient for each sample within the dataset, we gain the ability to evaluate not just the overall clustering performance but also the individual performance of each cluster. This granular analysis is crucial for identifying clusters that may not be well-defined or might contain elements that are essentially outliers, potentially skewing the overall analysis.
To implement this, we first used the silhouette_samples function from scikit-learn, which computes the silhouette coefficient for each sample, giving us a detailed breakdown of how well each sample fits within its assigned cluster compared to neighboring clusters. By aggregating these scores on a per-cluster basis, we were able to compute an average silhouette score for each cluster. This average score serves as a proxy for the cluster's quality, with higher scores indicating tighter and more distinct clusters, and lower scores suggesting clusters with overlapping or diffuse boundaries.
This approach allowed us to systematically evaluate each cluster's integrity. Clusters with low average silhouette scores were flagged for further exclusion.
The best part of any job is the visualization:
Instead of integrating the current algorithm into Tribler, I decided to focus on its improvement and dedicate half of the current week to this task.
I haven't yet focused on measuring the algorithm's performance because I want to first ensure that the clustering results are as accurate as possible. There are two main areas I'm currently working on to improve the quality of the clustering:
Once I'm confident that the algorithm is producing the best possible clustering outcomes, I'll turn my attention to optimizing its performance.
So, the next iteration of the algorithm contains two modifications:
Initially, our algorithm employed KMeans for clustering, which necessitates specifying the number of clusters a priori. This requirement posed a significant limitation, as determining the optimal number of clusters is not straightforward and can vary significantly depending on the dataset's nature and size. To address this challenge, we transitioned to using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). Unlike KMeans, HDBSCAN does not require pre-specification of the number of clusters. Instead, it dynamically identifies clusters based on data density, offering several advantages: Adaptability: HDBSCAN adapts to the inherent structure of the data, leading to more meaningful and natural groupings. Noise Handling: It effectively identifies and isolates noise, improving the overall quality of the clusters. Unlike KMeans, where every point is assigned to a cluster regardless of how well it fits, HDBSCAN can leave points unassigned (labeled as -1) Variable Cluster Sizes: The algorithm accommodates clusters of varying densities and sizes, aligning closer with real-world data distributions.
This shift aims to achieve more accurate and representative clustering by leveraging the data's natural structure, potentially enhancing the user experience through more precise content categorization.
The original vectorization approach using TFIDF (Term Frequency-Inverse Document Frequency) focused on individual terms without considering the order or proximity of words. To capture the contextual nuances and the sequence in which terms appear, we integrated n-grams into our TFIDF vectorization (TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2))
). N-grams are contiguous sequences of n items from a given sample of text or speech. By incorporating n-grams:
Contextual Awareness: The algorithm can now recognize and give weight to term proximity and order, capturing more nuanced meanings. Feature Enrichment: Including n-grams expands the feature set with phrase-level information, which is particularly beneficial for understanding the context and thematic content. Quality Improvement: This adjustment is anticipated to significantly enhance the quality of clustering by providing a richer, more contextually informed feature set for analysis.
To achieve more specific clustering results, such as differentiating between clusters for "Ubuntu 20.04.X" instead of a more general "Ubuntu 20.04," the following HDBSCAN constructor parameters can be adjusted: min_samples
and cluster_selection_epsilon
.
min_samples: This parameter influences how conservatively the algorithm defines what constitutes a dense cluster. By increasing min_samples, you can ensure that only groups of data points with a higher density are considered clusters, leading to more specific and tightly-knit clusters. For example, setting min_samples to a higher value could help distinguish between different subversions of Ubuntu 20.04 by requiring a denser concentration of points to form a cluster.
cluster_selection_epsilon: Adjusting this parameter affects the algorithm's sensitivity to forming new clusters based on density distance. A lower cluster_selection_epsilon value can lead to the formation of more, smaller clusters by preventing the merging of nearby clusters. This can be particularly useful for distinguishing between closely related but distinct versions or configurations, such as different patches of Ubuntu 20.04.
Conversely, to configure HDBSCAN for creating more general groups, the same parameters can be adjusted in the opposite direction. Decreasing min_samples
allows for more lenient cluster formation, potentially grouping various subversions of Ubuntu 20.04 into a single cluster. Similarly, increasing cluster_selection_epsilon
encourages the merging of nearby clusters into larger, more general groups.
By fine-tuning these parameters, HDBSCAN can be tailored to identify clusters at the desired level of specificity, from highly detailed clusters differentiating between minor variations to broader groups encompassing more general categories.
Below are two examples:
min_samples=1
and cluster_selection_epsilon=0
min_samples=3
and cluster_selection_epsilon=0.5
The selected parameter values for the HDBSCAN constructor are not definitive but are intended to illustrate the potential for enhancing clustering quality through careful optimization. The optimal settings for these parameters can significantly vary, underscoring the importance of adjustment based on the specific clustering goals. Intuitively, the choice of these values should align with the user's objectives: broader topic identification might necessitate one set of parameters, while uncovering more detailed, dense information may require a different configuration.
Ref:
The next step involved a deeper exploration of vectorization algorithms to determine if there are more advanced options beyond TFIDF that could better suit our needs. This exploration led us to experiment with FastText, an advanced word embedding technique known for capturing the nuances of word semantics and relationships more effectively than traditional TFIDF. FastText, by leveraging neural network models, generates vector representations of words that incorporate the context in which words appear, as well as the morphology of the words themselves.
While the results obtained with FastText were practically identical to those achieved with TFIDF, a key distinction emerged: FastText's flexibility in analyzing all presented tokens, not just the numeric ones as was the case in the previous version using TFIDF.
Ref:
This endeavor was an attempt to leverage transformers, specifically the BERT model, as a tokenizer in our clustering process. When utilizing BERT as a tokenizer, we observe that it delivers inferior results compared to analyzing either the entire titles or only the extracted digits. Additionally, BERT is noticeably slower and more resource-intensive.
I acknowledge that I haven't delved deeply into BERT's intricacies, as it's a complex model, and gaining a thorough understanding would require a substantial investment of time. My goal was to create a simple, almost out-of-the-box example to get a sense of how it operates.
Ref:
The final improvement in this iteration was an attempt to modify the standard TF-IDF algorithm to account for the position of tokens, which led to better results (comparable to N-Grams with TFIDF) than all other experiments conducted. By incorporating token positioning, we were able to differentiate between identical tokens based on their locations within the text, offering a deeper insight into the document's structure. Though our implementation is somewhat naive and likely not the most efficient in terms of performance, it represents a swift and straightforward prototype.
The trick is:
class PositionalDigitTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
# Custom analyzer that extracts digits and their position within the document
def positional_digit_analyzer(doc):
# Splitting the document into words
words = doc.split()
# Initializing a list to store digits with their position
positional_digits = []
# Iterating over words and their indexes in the list
for index, word in enumerate(words, start=1): # Indexing starts from 1
if word.isdigit(): # Checking if the word is a digit
# Adding to the list in the format "digit_position_in_text"
positional_digits.append(f"{word}_{index}")
positional_digits.append(f"{word}")
return positional_digits
return positional_digit_analyzer
This modification leads to tokens that look like:
Original: ubuntu-mate-20.04.4-desktop-amd64.iso
Preprocessed: ubuntu mate 20 4 4 desktop amd64 iso
TF-IDF:
4_5: 0.562
20_3: 0.527
4_4: 0.393
4: 0.358
20: 0.351
Original: ubuntu-mate-21.10-desktop-amd64.iso
Preprocessed: ubuntu mate 21 10 desktop amd64 iso
TF-IDF:
21_3: 0.624
10_4: 0.538
21: 0.488
10: 0.288
For further refinement of the algorithm, the following paper could be used as a reference: "Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word" available at https://www.atlantis-press.com/proceedings/aiie-16/25866330. This research suggests possible avenues for enhancing the complexity and effectiveness of our approach, indicating more advanced strategies for incorporating positional information into text vectorization.
Disclaimer: The code for the scripts above was generated by ChatGPT.
"Content Bundle" is a strategic feature in Tribler aimed at enhancing the organization and accessibility of digital content. It acts as an aggregation point for Content Items, bundling them together under a single, cohesive unit. This structure allows users to efficiently manage and access groups of related Content Items, simplifying navigation and retrieval. Ideal for categorizing content that shares common themes, attributes, or sources, the Content Bundle provides a streamlined way to handle complex sets of information, making it easier for users to find and interact with a rich array of content within the Tribler network.
The current representation of Content Items can be seen in the following picture:
We want them to have another layer of grouping:
Everything that we need already exists in our Knowledge Database, we can reuse the existing
CONTENT_ITEM
as follows:Or another structure:
Or
So it's an open question regarding the structure. Please suggest your ideas.
To complete this task, we need to:
Related:
6214