Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.79k stars 444 forks source link

KnowledgeCommunity: Content Bundle #7837

Closed drew2a closed 5 months ago

drew2a commented 8 months ago

"Content Bundle" is a strategic feature in Tribler aimed at enhancing the organization and accessibility of digital content. It acts as an aggregation point for Content Items, bundling them together under a single, cohesive unit. This structure allows users to efficiently manage and access groups of related Content Items, simplifying navigation and retrieval. Ideal for categorizing content that shares common themes, attributes, or sources, the Content Bundle provides a streamlined way to handle complex sets of information, making it easier for users to find and interact with a rich array of content within the Tribler network.

The current representation of Content Items can be seen in the following picture:

image

We want them to have another layer of grouping:

Ubuntu 20
|
├ Ubuntu 20.04
|  ├ infohash 1
|  ├ ...
|  └ infohash N
|
└ Ubuntu 20.10
   ├ infohash K
   ├ ...
   └ infohash M

Everything that we need already exists in our Knowledge Database, we can reuse the existing CONTENT_ITEM as follows:

subject_type |   subject    | object_type  |    object    |
===========================================================
TORRENT      |  infohash 1  | CONTENT_ITEM | Ubuntu 20.04 |
TORRENT      |  infohash N  | CONTENT_ITEM | Ubuntu 20.04 |
TORRENT      |  infohash K  | CONTENT_ITEM | Ubuntu 20.10 |
TORRENT      |  infohash M  | CONTENT_ITEM | Ubuntu 20.10 |
CONTENT_ITEM | Ubuntu 20.04 | CONTENT_ITEM |   Ubuntu 20  |
CONTENT_ITEM | Ubuntu 20.10 | CONTENT_ITEM |   Ubuntu 20  |

Or another structure:


subject_type |         subject              | object_type  |    object    |
===========================================================================
TORRENT      |        infohash 1            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash N            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash K            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash M            | CONTENT_ITEM | Ubuntu 20    |
CONTENT_ITEM | HASH(infohash 1 + Ubuntu 20) | CONTENT_ITEM |      04      |
CONTENT_ITEM | HASH(infohash N + Ubuntu 20) | CONTENT_ITEM |      04      |
CONTENT_ITEM | HASH(infohash K + Ubuntu 20) | CONTENT_ITEM |      10      |
CONTENT_ITEM | HASH(infohash M + Ubuntu 20) | CONTENT_ITEM |      10      |

Or

subject_type |         subject              | object_type  |    object    |
===========================================================================
TORRENT      |        infohash 1            | CONTENT_ITEM |      04      |
TORRENT      |        infohash N            | CONTENT_ITEM |      04      |
TORRENT      |        infohash K            | CONTENT_ITEM |      10      |
TORRENT      |        infohash M            | CONTENT_ITEM |      10      |
CONTENT_ITEM |    HASH(infohash 1 + 04)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash N + 04)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash K + 10)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash M + 10)     | CONTENT_ITEM | Ubuntu 22    |

So it's an open question regarding the structure. Please suggest your ideas.

To complete this task, we need to:

Related:

synctext commented 7 months ago

We deployed Network Buzz within Tribler in 2010. This is what Reddit had to say then about it: Nothing has changed as of today. Thanks to the "network buzz" feature (which you can't turn off) it almost never goes below 10% CPU utilization and sometimes just sits at 50% maxing out one of my 2 cores. Related prior work is the tag systems since 2011, we lack the user community for this. MusicBrainz has volunteers with over 1 million edit/tagging contributions, that is our leading example.

Please do a 10-day prototype @drew2a. You now did top-down design exploration. Time for bottom-up "learn-by-doing". We do not know if "content bundling" can be done exclusive and perfect with local heuristics and zero database changes. Or we need to store, gather rich metadata and offer content enrichment plus database changes. What about the near-duplicates we studied for years (never could fix)?

Lets leave the anti-spam for future sprints :exclamation: No need to distract @grimadas from deployment and debugging the low-level rendezvous component. Only in 2025 we will re-visit the Justin Bieber is gay, tag spam problem.

drew2a commented 7 months ago

The first attempt at trying to group search results locally doesn't offer much hope, as it tends to group quite random torrents together without organizing them into the same content group.

The developed script:

  1. Load Titles: Loads titles from a specified text file.
  2. Download NLTK Resources: Downloads necessary NLTK resources like stopwords and WordNet for lemmatization.
  3. Preprocess Text: Includes removing text inside parentheses and brackets, converting to lowercase, removing punctuation and stopwords, and lemmatizing.
  4. Vectorize Text: Uses TF-IDF to convert the preprocessed titles into numerical vectors.
  5. Cluster Titles: Applies K-means clustering to the vectorized titles.
  6. Output Results: Groups titles by their cluster and prints them.
```python from collections import defaultdict from pathlib import Path import nltk from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import string def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r ) print('Results:') for r in results[:50]: print(f'\t{r}') first_title = results[0] # Preprocess each title print('\nPreprocessed results:') preprocessed_results = [preprocess_text(title) for title in results] for r in preprocessed_results: print(f'\t{r}') # Vectorize text using TF-IDF vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(preprocessed_results) print("Clustering...") # Cluster using K-means kmeans = KMeans(random_state=42) kmeans.fit(X) # Output clustering results labels = kmeans.labels_ clusters = defaultdict(list) # Group titles by their clusters for i, label in enumerate(labels): clusters[label].append(results[i]) # Print clustering results by cluster print("Clustering results by cluster:") for cluster, titles in clusters.items(): print(f"\nCluster {cluster}:") for title in titles: print(f"- {title}") ``` Results for ubuntu: ``` Results: Ubuntu Linux основы администрирования ubuntu-14.04.5-server-amd64.iso ubuntu-mate-20.04.3-desktop-amd64.iso ubuntu-18.04-live-server-amd64.iso Ubuntu 20.04.2.0 Desktop (64-bit) ubuntu-22.04-live-server-amd64.iso Ubuntu 20.04.3 (AMD64) (Server) ubuntu-17.04-server-amd64.iso ubuntu-20.04.4-desktop-amd64.iso ubuntu-20.04.4-live-server-amd64.iso ubuntu-22.04.1-desktop-amd64.iso Ubuntu reducido ubuntu ubuntu-11.04-desktop-amd64.iso ubuntu-11.04-alternate-i386.iso Ubuntu 12.10 Desktop (i386) ubuntu-20.04.1-desktop-amd64.iso ubuntu-22.04-desktop-amd64.iso Ubuntu 20.04.1 Desktop.iso ubuntu-22.10-desktop-amd64.iso ubuntu-18.10-desktop-amd64.iso ubuntu-14.10-desktop-i386.iso [Ubuntu] Anonymous OS 0.1 Ubuntu 9.10 ubuntu-15.04-desktop-i386.iso ubuntu-14.04-desktop-i386.iso ubuntu-11.10-dvd-amd64.iso ubuntu-15.04-desktop-amd64.iso ubuntu-18.04.3-live-server-amd64.iso ubuntu-21.04-desktop-amd64.iso Ubuntu 16.10 ubuntu-23.04-live-server-amd64.iso ubuntu-18.10-server-amd64.iso ubuntu-18.04.1-desktop-amd64.iso ubuntu-14.04.4-desktop-amd64.iso ubuntu-14.04-server-amd64.ova ubuntu-16.04-desktop-i386.iso ubuntu-16.04.6-server-amd64.iso ubuntu-10.10-xenon-beta5 ubuntu-20.04.3-desktop-amd64.iso Ubuntu ubuntu-20.04.2-desktop-amd64.iso Ubuntu Facile 01 2014.pdf ubuntu-12.04.5-dvd-i386.iso ubuntu-17.10-desktop-amd64.iso ubuntu-mate-20.04.4-desktop-amd64.iso Ubuntu Unleashed 2019 Edition ubuntu-20.04-live-server-amd64.iso ubuntu-11.10-desktop-i386.iso Ubuntu Ultimate Edition 1.9 Preprocessed results: ubuntu linux основы администрирования ubuntu 14 04 5 server amd64 iso ubuntu mate 20 04 3 desktop amd64 iso ubuntu 18 04 live server amd64 iso ubuntu 20 04 2 0 desktop ubuntu 22 04 live server amd64 iso ubuntu 20 04 3 ubuntu 17 04 server amd64 iso ubuntu 20 04 4 desktop amd64 iso ubuntu 20 04 4 live server amd64 iso ubuntu 22 04 1 desktop amd64 iso ubuntu reducido ubuntu ubuntu 11 04 desktop amd64 iso ubuntu 11 04 alternate i386 iso ubuntu 12 10 desktop ubuntu 20 04 1 desktop amd64 iso ubuntu 22 04 desktop amd64 iso ubuntu 20 04 1 desktop iso ubuntu 22 10 desktop amd64 iso ubuntu 18 10 desktop amd64 iso ubuntu 14 10 desktop i386 iso anonymous o 0 1 ubuntu 9 10 ubuntu 15 04 desktop i386 iso ubuntu 14 04 desktop i386 iso ubuntu 11 10 dvd amd64 iso ubuntu 15 04 desktop amd64 iso ubuntu 18 04 3 live server amd64 iso ubuntu 21 04 desktop amd64 iso ubuntu 16 10 ubuntu 23 04 live server amd64 iso ubuntu 18 10 server amd64 iso ubuntu 18 04 1 desktop amd64 iso ubuntu 14 04 4 desktop amd64 iso ubuntu 14 04 server amd64 ovum ubuntu 16 04 desktop i386 iso ubuntu 16 04 6 server amd64 iso ubuntu 10 10 xenon beta5 ubuntu 20 04 3 desktop amd64 iso ubuntu ubuntu 20 04 2 desktop amd64 iso ubuntu facile 01 2014 pdf ubuntu 12 04 5 dvd i386 iso ubuntu 17 10 desktop amd64 iso ubuntu mate 20 04 4 desktop amd64 iso ubuntu unleashed 2019 edition ubuntu 20 04 live server amd64 iso ubuntu 11 10 desktop i386 iso ubuntu ultimate edition 1 9 ubuntu netbook remix ubuntu 21 10 desktop amd64 iso ubuntu budgie 22 04 3 desktop amd64 iso ubuntu 16 04 3 server amd64 iso ubuntu 16 04 5 ubuntu 14 04 6 desktop amd64 mac iso ubuntu 12 10 desktop i386 iso ubuntu 9 10 пользовательская сборка ubuntu 20 04 2 0 desktop amd64 iso ubuntu 16 10 server arm64 iso ubuntu satanic edition 666 4 ubuntu 23 10 beta desktop amd64 iso ubuntu mate 19 10 desktop amd64 iso ubuntu 21 04 live server amd64 iso ubuntu 21 10 beta pack ubuntu 20 04 desktop amd64 iso ubuntu 12 04 4 desktop amd64 mac iso ubuntu 16 10 desktop i386 iso ubuntu linux ebook pack ubuntu 14 10 desktop amd64 iso ubuntu 13 04 desktop i386 iso ubuntu 12 04 5 desktop i386 iso ubuntu 18 04 ubuntu 16 04 7 server amd64 iso ubuntu 12 04 server i386 iso ubuntu 22 04 3 live server amd64 iso ubuntu server essential 6685 ubuntu 12 04 5 desktop amd64 iso ubuntu 14 10 server amd64 iso ubuntu 19 04 desktop amd64 iso ubuntu book ru djvu ubuntu 16 04 5 desktop amd64 iso ubuntu 15 04 server amd64 iso ubuntu unity 22 10 desktop amd64 iso ubuntu 11 10 oneiric ocelot ubuntu mate 21 10 desktop amd64 iso ubuntu 18 04 6 desktop amd64 iso ubuntu 20 10 desktop amd64 iso ubuntu facile aprile 2015 pdf ubuntu 18 04 desktop amd64 iso ubuntu server 20 04 2 lts ubuntu 18 04 4 desktop amd64 iso ubuntu pack 16 04 unity ubuntu 10 04 netbook ubuntu 14 04 1 server amd64 iso ubuntu facile marzo 2015 pdf ubuntu ultimate 1 4 dvd ubuntu 14 04 server i386 iso ubuntu 14 04 6 desktop i386 iso ubuntu 20 04 x64 untouched david1893 ubuntu 18 04 5 live server amd64 iso ubuntu 16 04 6 server i386 iso ubuntu facile 04 2014 pdf ubuntu 19 04 server amd64 iso ubuntu 16 04 7 desktop amd64 iso ubuntu 19 10 desktop amd64 iso ubuntu 22 04 2 desktop amd64 iso ubuntu 16 04 6 desktop i386 iso ubuntu 16 10 desktop amd64 iso ubuntu 19 10 live server amd64 iso ubuntu 14 04 desktop amd64 iso Clustering... Clustering results by cluster: Cluster 0: - Ubuntu Linux основы администрирования - ubuntu-mate-20.04.3-desktop-amd64.iso - Ubuntu 20.04.2.0 Desktop (64-bit) - Ubuntu 20.04.3 (AMD64) (Server) - ubuntu-20.04.4-desktop-amd64.iso - ubuntu-20.04.4-live-server-amd64.iso - ubuntu-22.04.1-desktop-amd64.iso - Ubuntu reducido - ubuntu - ubuntu-20.04.1-desktop-amd64.iso - ubuntu-22.04-desktop-amd64.iso - Ubuntu 20.04.1 Desktop.iso - ubuntu-15.04-desktop-amd64.iso - ubuntu-21.04-desktop-amd64.iso - ubuntu-20.04.3-desktop-amd64.iso - Ubuntu - ubuntu-20.04.2-desktop-amd64.iso - ubuntu-mate-20.04.4-desktop-amd64.iso - ubuntu-20.04-live-server-amd64.iso - Ubuntu Netbook Remix - ubuntu-budgie-22.04.3-desktop-amd64.iso - ubuntu-20.04.2.0-desktop-amd64.iso - ubuntu-20.04-desktop-amd64.iso - ubuntu-12.04.4-desktop-amd64+mac.iso - Ubuntu Linux ebook pack - ubuntu-12.04.5-desktop-amd64.iso - ubuntu-19.04-desktop-amd64.iso - Ubuntu-Book_RU.djvu - ubuntu-20.10-desktop-amd64.iso - Ubuntu Server 20.04.2 LTS - Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 - ubuntu-22.04.2-desktop-amd64.iso Cluster 1: - ubuntu-14.04.5-server-amd64.iso - ubuntu-14.10-desktop-i386.iso - ubuntu-14.04-desktop-i386.iso - ubuntu-14.04.4-desktop-amd64.iso - ubuntu-14.04-server-amd64.ova - ubuntu-14.04.6-desktop-amd64+mac.iso - ubuntu-14.10-desktop-amd64.iso - ubuntu-14.10-server-amd64.iso - ubuntu-14.04.1-server-amd64.iso - ubuntu-14.04-server-i386.iso - ubuntu-14.04.6-desktop-i386.iso - ubuntu-14.04-desktop-amd64.iso Cluster 2: - ubuntu-18.04-live-server-amd64.iso - Ubuntu 12.10 Desktop (i386) - ubuntu-22.10-desktop-amd64.iso - ubuntu-18.10-desktop-amd64.iso - Ubuntu 9.10 - ubuntu-18.04.3-live-server-amd64.iso - ubuntu-18.10-server-amd64.iso - ubuntu-18.04.1-desktop-amd64.iso - ubuntu-10.10-xenon-beta5 - ubuntu-17.10-desktop-amd64.iso - ubuntu-21.10-desktop-amd64.iso - Ubuntu 9.10 Пользовательская сборка - ubuntu-23.10-beta-desktop-amd64.iso - ubuntu-mate-19.10-desktop-amd64.iso - ubuntu-21.10-beta-pack - Ubuntu-18.04 - ubuntu-unity-22.10-desktop-amd64.iso - ubuntu-mate-21.10-desktop-amd64.iso - ubuntu-18.04.6-desktop-amd64.iso - ubuntu-18.04-desktop-amd64.iso - ubuntu-18.04.4-desktop-amd64.iso - Ubuntu 10.04 Netbook - ubuntu-18.04.5-live-server-amd64.iso - ubuntu-19.10-desktop-amd64.iso Cluster 5: - ubuntu-22.04-live-server-amd64.iso - ubuntu-17.04-server-amd64.iso - Ubuntu 16.10 - ubuntu-23.04-live-server-amd64.iso - ubuntu-16.04.6-server-amd64.iso - ubuntu-16.04.3-server-amd64.iso - Ubuntu-16.04.5 - ubuntu-16.10-server-arm64.iso - ubuntu-21.04-live-server-amd64.iso - ubuntu-16.04.7-server-amd64.iso - ubuntu-22.04.3-live-server-amd64.iso - Ubuntu Server Essentials - 6685 [ECLiPSE] - ubuntu-16.04.5-desktop-amd64.iso - ubuntu-15.04-server-amd64.iso - ubuntu-pack-16.04-unity - ubuntu-19.04-server-amd64.iso - ubuntu-16.04.7-desktop-amd64.iso - ubuntu-16.10-desktop-amd64.iso - ubuntu-19.10-live-server-amd64.iso Cluster 4: - ubuntu-11.04-desktop-amd64.iso - ubuntu-11.04-alternate-i386.iso - ubuntu-11.10-dvd-amd64.iso - ubuntu-11.10-desktop-i386.iso - Ubuntu 11.10 Oneiric Ocelot - ubuntu-ultimate-1.4-dvd Cluster 3: - [Ubuntu] Anonymous OS 0.1 - ubuntu-15.04-desktop-i386.iso - ubuntu-16.04-desktop-i386.iso - ubuntu-12.04.5-dvd-i386.iso - ubuntu-12.10-desktop-i386.iso - ubuntu-16.10-desktop-i386.iso - ubuntu-13.04-desktop-i386.iso - ubuntu-12.04.5-desktop-i386.iso - ubuntu-12.04-server-i386.iso - ubuntu-16.04.6-server-i386.iso - ubuntu-16.04.6-desktop-i386.iso Cluster 7: - Ubuntu Facile 01 2014.pdf - Ubuntu Facile - Aprile 2015.pdf - Ubuntu Facile Marzo 2015.pdf - Ubuntu Facile 04 2014.pdf Cluster 6: - Ubuntu Unleashed 2019 Edition - Ubuntu Ultimate Edition 1.9 - Ubuntu Satanic Edition 666.4 ```
drew2a commented 7 months ago

Expanding on the same idea: what if instead of searching for all similar entries in the search results, we look for entries similar only to the first (most relevant) result?

The developed script:

  1. Load Titles: Reads titles from a specified text file, which are then prepared for processing. This step is essential for acquiring the raw data that will be analyzed and clustered based on similarity.
  2. Download NLTK Resources: Utilizes the Natural Language Toolkit (NLTK) to download necessary resources such as stopwords and the WordNet lemmatizer. These resources are critical for the text preprocessing stage, enabling the removal of common words that offer little value to the analysis and the conversion of words to their base or root form.
  3. Preprocess Text: Employs several preprocessing techniques to clean and standardize the titles. This includes removing text inside parentheses and brackets with regular expressions, converting all text to lowercase to ensure uniformity, eliminating punctuation to reduce noise, removing stopwords to focus on meaningful words, and lemmatizing words to their base form to consolidate different forms of the same word.
  4. Vectorize Text Using TF-IDF: Transforms the preprocessed text into numerical vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) method. This vectorization reflects the importance of words within the titles relative to the dataset, allowing for the quantitative comparison of text.
  5. Calculate Cosine Similarity: After vectorization, calculates the cosine similarity between the vector of the first title and the vectors of all subsequent titles. This similarity measurement is pivotal in identifying titles that are most similar to the first, presumed most relevant title.
  6. Cluster Based on Similarity: Instead of applying a traditional clustering algorithm like K-means, titles are grouped based on their cosine similarity to the first title. This method clusters titles by directly comparing their similarity scores, allowing for dynamic cluster formation based on a predefined similarity threshold.
  7. Output Results by Similarity: Outputs the titles, organized by their similarity to the first title. This step highlights the effectiveness of using the search engine's ranking to prioritize and group results, showcasing the clustered titles in a structured and understandable format.

But the results of the experiment are still far from ideal and even from a minimally viable product.

```python from pathlib import Path import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import string def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list( r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r ) print('Results:') for r in results[:50]: print(f'\t{r}') first_title = results[0] # Preprocess each title preprocessed_results = [preprocess_text(title) for title in results] for r in preprocessed_results: print(f'\t{r}') # Vectorize text using TF-IDF vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(preprocessed_results) # Calculate cosine similarity similarity_matrix = cosine_similarity(X[0:1], X) # Define a similarity threshold similarity_threshold = 0.4 # Adjust this threshold as needed # Filter indices by similarity threshold and keep similarity values filtered_indices_and_similarity = [(i, similarity_matrix[0, i]) for i in range(similarity_matrix.shape[1]) if similarity_matrix[0, i] >= similarity_threshold] # Sort filtered indices by similarity with the first title, keeping similarity values sorted_filtered_indices_and_similarity = sorted(filtered_indices_and_similarity, key=lambda x: -x[1]) # Gather similar titles with their similarity values similar_titles_with_similarity = [(results[i], similarity) for i, similarity in sorted_filtered_indices_and_similarity] # Print similar titles with similarity values print(f"\nTitles similar to the first title ({first_title}):") for title, similarity in similar_titles_with_similarity: print(f"- {title} (Similarity: {similarity:.3f})") ``` Output: ``` Results: ubuntu-16.04.6-server-amd64.iso ubuntu-20.04-live-server-amd64.iso ubuntu-18.04.3-live-server-amd64.iso ubuntu-19.10-desktop-amd64.iso ubuntu-18.04.6-desktop-amd64.iso ubuntu-23.10-beta-desktop-amd64.iso ubuntu-18.04.5-live-server-amd64.iso ubuntu-14.04-desktop-i386.iso Ubuntu 16.10 ubuntu-18.10-server-amd64.iso ubuntu-16.10-desktop-i386.iso ubuntu-14.04.1-server-amd64.iso Ubuntu Satanic Edition 666.4 ubuntu-mate-19.10-desktop-amd64.iso ubuntu-22.04.1-desktop-amd64.iso ubuntu-16.04.6-server-i386.iso ubuntu-16.04.3-server-amd64.iso ubuntu-mate-21.10-desktop-amd64.iso ubuntu-22.10-desktop-amd64.iso ubuntu-22.04.2-desktop-amd64.iso ubuntu-19.04-server-amd64.iso ubuntu-16.10-server-arm64.iso Ubuntu Netbook Remix ubuntu-14.04.6-desktop-amd64+mac.iso ubuntu-21.04-desktop-amd64.iso ubuntu-17.04-server-amd64.iso ubuntu-pack-16.04-unity Ubuntu 20.04.2.0 Desktop (64-bit) ubuntu-14.04-server-amd64.ova Ubuntu-16.04.5 ubuntu-16.04.5-desktop-amd64.iso ubuntu-15.04-desktop-i386.iso Ubuntu 12.10 Desktop (i386) ubuntu-12.04.4-desktop-amd64+mac.iso ubuntu-11.04-desktop-amd64.iso Ubuntu Server 20.04.2 LTS ubuntu-20.04.1-desktop-amd64.iso ubuntu-15.04-server-amd64.iso ubuntu-16.04-desktop-i386.iso Ubuntu reducido ubuntu-14.04-server-i386.iso ubuntu-19.04-desktop-amd64.iso ubuntu-12.04.5-desktop-amd64.iso ubuntu-14.04.6-desktop-i386.iso ubuntu-12.10-desktop-i386.iso ubuntu-21.10-beta-pack ubuntu-ultimate-1.4-dvd ubuntu-mate-20.04.4-desktop-amd64.iso ubuntu-18.04-live-server-amd64.iso Ubuntu 10.04 Netbook Preprocessed results: ubuntu 16 04 6 server amd64 iso ubuntu 20 04 live server amd64 iso ubuntu 18 04 3 live server amd64 iso ubuntu 19 10 desktop amd64 iso ubuntu 18 04 6 desktop amd64 iso ubuntu 23 10 beta desktop amd64 iso ubuntu 18 04 5 live server amd64 iso ubuntu 14 04 desktop i386 iso ubuntu 16 10 ubuntu 18 10 server amd64 iso ubuntu 16 10 desktop i386 iso ubuntu 14 04 1 server amd64 iso ubuntu satanic edition 666 4 ubuntu mate 19 10 desktop amd64 iso ubuntu 22 04 1 desktop amd64 iso ubuntu 16 04 6 server i386 iso ubuntu 16 04 3 server amd64 iso ubuntu mate 21 10 desktop amd64 iso ubuntu 22 10 desktop amd64 iso ubuntu 22 04 2 desktop amd64 iso ubuntu 19 04 server amd64 iso ubuntu 16 10 server arm64 iso ubuntu netbook remix ubuntu 14 04 6 desktop amd64 mac iso ubuntu 21 04 desktop amd64 iso ubuntu 17 04 server amd64 iso ubuntu pack 16 04 unity ubuntu 20 04 2 0 desktop ubuntu 14 04 server amd64 ovum ubuntu 16 04 5 ubuntu 16 04 5 desktop amd64 iso ubuntu 15 04 desktop i386 iso ubuntu 12 10 desktop ubuntu 12 04 4 desktop amd64 mac iso ubuntu 11 04 desktop amd64 iso ubuntu server 20 04 2 lts ubuntu 20 04 1 desktop amd64 iso ubuntu 15 04 server amd64 iso ubuntu 16 04 desktop i386 iso ubuntu reducido ubuntu 14 04 server i386 iso ubuntu 19 04 desktop amd64 iso ubuntu 12 04 5 desktop amd64 iso ubuntu 14 04 6 desktop i386 iso ubuntu 12 10 desktop i386 iso ubuntu 21 10 beta pack ubuntu ultimate 1 4 dvd ubuntu mate 20 04 4 desktop amd64 iso ubuntu 18 04 live server amd64 iso ubuntu 10 04 netbook ubuntu 14 04 4 desktop amd64 iso ubuntu 20 04 1 desktop iso ubuntu 14 04 5 server amd64 iso ubuntu 12 04 5 desktop i386 iso ubuntu linux ebook pack anonymous o 0 1 ubuntu 20 04 desktop amd64 iso ubuntu 11 10 desktop i386 iso ubuntu book ru djvu ubuntu 9 10 ubuntu 23 04 live server amd64 iso ubuntu 17 10 desktop amd64 iso ubuntu 16 04 6 desktop i386 iso ubuntu mate 20 04 3 desktop amd64 iso ubuntu ubuntu linux основы администрирования ubuntu 14 04 desktop amd64 iso ubuntu 20 04 2 0 desktop amd64 iso ubuntu 21 10 desktop amd64 iso ubuntu 20 04 x64 untouched david1893 ubuntu 12 04 5 dvd i386 iso ubuntu 12 04 server i386 iso ubuntu ubuntu 20 04 2 desktop amd64 iso ubuntu 14 10 desktop i386 iso ubuntu 16 04 7 server amd64 iso ubuntu 11 10 dvd amd64 iso ubuntu facile 04 2014 pdf ubuntu 14 10 desktop amd64 iso ubuntu 18 10 desktop amd64 iso ubuntu 18 04 1 desktop amd64 iso ubuntu 16 04 7 desktop amd64 iso ubuntu 15 04 desktop amd64 iso ubuntu 22 04 3 live server amd64 iso ubuntu 19 10 live server amd64 iso ubuntu unleashed 2019 edition ubuntu server essential 6685 ubuntu 20 10 desktop amd64 iso ubuntu budgie 22 04 3 desktop amd64 iso ubuntu 18 04 desktop amd64 iso ubuntu 22 04 desktop amd64 iso ubuntu 14 10 server amd64 iso ubuntu 20 04 3 ubuntu 13 04 desktop i386 iso ubuntu 20 04 4 live server amd64 iso ubuntu 18 04 ubuntu 20 04 4 desktop amd64 iso ubuntu 18 04 4 desktop amd64 iso ubuntu 11 10 oneiric ocelot ubuntu 16 10 desktop amd64 iso ubuntu facile marzo 2015 pdf ubuntu 9 10 пользовательская сборка ubuntu 10 10 xenon beta5 ubuntu 20 04 3 desktop amd64 iso ubuntu 11 04 alternate i386 iso ubuntu facile aprile 2015 pdf ubuntu unity 22 10 desktop amd64 iso ubuntu facile 01 2014 pdf ubuntu ultimate edition 1 9 ubuntu 21 04 live server amd64 iso ubuntu 22 04 live server amd64 iso Titles similar to the first title (ubuntu-16.04.6-server-amd64.iso): ubuntu-16.04.6-server-amd64.iso (Similarity: 1.000) ubuntu-16.04.3-server-amd64.iso (Similarity: 1.000) ubuntu-16.04.7-server-amd64.iso (Similarity: 1.000) ubuntu-16.04.5-desktop-amd64.iso (Similarity: 0.795) ubuntu-16.04.7-desktop-amd64.iso (Similarity: 0.795) ubuntu-16.04.6-server-i386.iso (Similarity: 0.790) Ubuntu-16.04.5 (Similarity: 0.742) ubuntu-16.10-desktop-amd64.iso (Similarity: 0.639) ubuntu-16.04-desktop-i386.iso (Similarity: 0.593) ubuntu-16.04.6-desktop-i386.iso (Similarity: 0.593) ubuntu-14.04.1-server-amd64.iso (Similarity: 0.584) ubuntu-14.04.5-server-amd64.iso (Similarity: 0.584) Ubuntu 16.10 (Similarity: 0.542) ubuntu-16.10-server-arm64.iso (Similarity: 0.536) ubuntu-19.04-server-amd64.iso (Similarity: 0.525) ubuntu-15.04-server-amd64.iso (Similarity: 0.497) ubuntu-20.04-live-server-amd64.iso (Similarity: 0.492) ubuntu-20.04.4-live-server-amd64.iso (Similarity: 0.492) ubuntu-17.04-server-amd64.iso (Similarity: 0.478) ubuntu-18.04.3-live-server-amd64.iso (Similarity: 0.473) ubuntu-18.04.5-live-server-amd64.iso (Similarity: 0.473) ubuntu-18.04-live-server-amd64.iso (Similarity: 0.473) ubuntu-16.10-desktop-i386.iso (Similarity: 0.471) ubuntu-22.04.3-live-server-amd64.iso (Similarity: 0.464) ubuntu-22.04-live-server-amd64.iso (Similarity: 0.464) ubuntu-14.10-server-amd64.iso (Similarity: 0.455) ubuntu-18.10-server-amd64.iso (Similarity: 0.446) ubuntu-21.04-live-server-amd64.iso (Similarity: 0.446) ubuntu-14.04-server-i386.iso (Similarity: 0.423) ubuntu-23.04-live-server-amd64.iso (Similarity: 0.416) ubuntu-12.04-server-i386.iso (Similarity: 0.401) ```
synctext commented 7 months ago

We learned something! I genuinely don't find it a bad start.

Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?

drew2a commented 7 months ago

I've modified the original script to address the question "Seems number signal is thrown away?" and to print all TF-IDF values for each term.

Indeed, it was discovered that certain digits were being ignored, specifically those consisting of a single character. This occurred because the vectorizer, by default, disregards all terms that are composed of only one character.

In the example below the number 5 are ignored:

    Original:     ubuntu-12.04.5-desktop-i386.iso
    Preprocessed: ubuntu 12 04 5 desktop i386 iso
    TF-IDF:
            12: 0.668
            i386: 0.530
            desktop: 0.318
            04: 0.275
            iso: 0.248
            ubuntu: 0.185

I've fix it and also added more output to discern the terms around which titles were grouped into clusters, I analyzed the centroids of the clusters determined by the K-means algorithm. The centroids represent the "center" or "mean" vector of each cluster in the feature space, essentially capturing the average importance of each term within the cluster. By examining these centroids, we can identify which terms have the highest TF-IDF values across the documents in a cluster, giving us insight into the thematic essence of each cluster.

This information should provide us with a better understanding of the details involved in grouping items into clusters.

``` Features: ['0' '1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '2014' '2015' '2019' '21' '22' '23' '3' '4' '5' '6' '666' '6685' '7' '9' 'alternate' 'amd64' 'anonymous' 'aprile' 'arm64' 'beta' 'beta5' 'book' 'budgie' 'david1893' 'desktop' 'djvu' 'dvd' 'ebook' 'edition' 'essential' 'facile' 'i386' 'iso' 'linux' 'live' 'lts' 'mac' 'marzo' 'mate' 'netbook' 'o' 'ocelot' 'oneiric' 'ovum' 'pack' 'pdf' 'reducido' 'remix' 'ru' 'satanic' 'server' 'ubuntu' 'ultimate' 'unity' 'unleashed' 'untouched' 'x64' 'xenon' 'администрирования' 'основы' 'пользовательская' 'сборка'] Original and preprocessed titles with their TF-IDF vectors: Original: Ubuntu-16.04.5 Preprocessed: ubuntu 16 4 5 TF-IDF: 5: 0.721 16: 0.596 4: 0.291 ubuntu: 0.200 Original: ubuntu-mate-21.10-desktop-amd64.iso Preprocessed: ubuntu mate 21 10 desktop amd64 iso TF-IDF: mate: 0.606 21: 0.579 10: 0.342 desktop: 0.255 amd64: 0.235 iso: 0.199 ubuntu: 0.149 Original: Ubuntu 9.10 Пользовательская сборка Preprocessed: ubuntu 9 10 пользовательская сборка TF-IDF: сборка: 0.578 пользовательская: 0.578 9: 0.498 10: 0.266 ubuntu: 0.116 Original: Ubuntu Linux ebook pack Preprocessed: ubuntu linux ebook pack TF-IDF: ebook: 0.617 linux: 0.567 pack: 0.532 ubuntu: 0.124 Original: ubuntu-23.04-live-server-amd64.iso Preprocessed: ubuntu 23 4 live server amd64 iso TF-IDF: 23: 0.684 live: 0.492 server: 0.353 amd64: 0.236 4: 0.218 iso: 0.200 ubuntu: 0.149 Original: ubuntu-12.04.4-desktop-amd64+mac.iso Preprocessed: ubuntu 12 4 4 desktop amd64 mac iso TF-IDF: mac: 0.643 12: 0.507 4: 0.409 desktop: 0.241 amd64: 0.222 iso: 0.188 ubuntu: 0.140 Original: ubuntu-11.04-alternate-i386.iso Preprocessed: ubuntu 11 4 alternate i386 iso TF-IDF: alternate: 0.684 11: 0.534 i386: 0.393 4: 0.200 iso: 0.184 ubuntu: 0.137 Original: ubuntu-22.04.1-desktop-amd64.iso Preprocessed: ubuntu 22 4 1 desktop amd64 iso TF-IDF: 22: 0.599 1: 0.581 desktop: 0.294 amd64: 0.271 4: 0.250 iso: 0.229 ubuntu: 0.172 Original: ubuntu-18.04.1-desktop-amd64.iso Preprocessed: ubuntu 18 4 1 desktop amd64 iso TF-IDF: 1: 0.593 18: 0.576 desktop: 0.300 amd64: 0.276 4: 0.255 iso: 0.234 ubuntu: 0.175 Original: Ubuntu 20.04.1 Desktop.iso Preprocessed: ubuntu 20 4 1 desktop iso TF-IDF: 1: 0.646 20: 0.545 desktop: 0.327 4: 0.278 iso: 0.255 ubuntu: 0.191 Original: ubuntu-16.04-desktop-i386.iso Preprocessed: ubuntu 16 4 desktop i386 iso TF-IDF: 16: 0.598 i386: 0.573 desktop: 0.343 4: 0.292 iso: 0.268 ubuntu: 0.200 Original: ubuntu-17.10-desktop-amd64.iso Preprocessed: ubuntu 17 10 desktop amd64 iso TF-IDF: 17: 0.780 10: 0.391 desktop: 0.292 amd64: 0.269 iso: 0.228 ubuntu: 0.170 Original: ubuntu-23.10-beta-desktop-amd64.iso Preprocessed: ubuntu 23 10 beta desktop amd64 iso TF-IDF: beta: 0.615 23: 0.615 10: 0.309 desktop: 0.230 amd64: 0.212 iso: 0.180 ubuntu: 0.134 Original: ubuntu-20.04.2.0-desktop-amd64.iso Preprocessed: ubuntu 20 4 2 0 desktop amd64 iso TF-IDF: 0: 0.595 2: 0.539 20: 0.396 desktop: 0.237 amd64: 0.219 4: 0.202 iso: 0.185 ubuntu: 0.139 Original: ubuntu-12.04-server-i386.iso Preprocessed: ubuntu 12 4 server i386 iso TF-IDF: 12: 0.641 i386: 0.508 server: 0.420 4: 0.259 iso: 0.238 ubuntu: 0.178 Original: Ubuntu Unleashed 2019 Edition Preprocessed: ubuntu unleashed 2019 edition TF-IDF: 2019: 0.599 unleashed: 0.599 edition: 0.517 ubuntu: 0.120 Original: Ubuntu Linux основы администрирования Preprocessed: ubuntu linux основы администрирования TF-IDF: администрирования: 0.589 основы: 0.589 linux: 0.541 ubuntu: 0.118 Original: ubuntu-22.10-desktop-amd64.iso Preprocessed: ubuntu 22 10 desktop amd64 iso TF-IDF: 22: 0.689 10: 0.453 desktop: 0.338 amd64: 0.311 iso: 0.264 ubuntu: 0.197 Original: ubuntu-ultimate-1.4-dvd Preprocessed: ubuntu ultimate 1 4 dvd TF-IDF: ultimate: 0.623 dvd: 0.584 1: 0.461 4: 0.198 ubuntu: 0.136 Original: ubuntu-19.04-server-amd64.iso Preprocessed: ubuntu 19 4 server amd64 iso TF-IDF: 19: 0.734 server: 0.446 amd64: 0.297 4: 0.275 iso: 0.252 ubuntu: 0.189 Original: ubuntu-15.04-server-amd64.iso Preprocessed: ubuntu 15 4 server amd64 iso TF-IDF: 15: 0.766 server: 0.422 amd64: 0.281 4: 0.260 iso: 0.239 ubuntu: 0.178 Original: ubuntu-budgie-22.04.3-desktop-amd64.iso Preprocessed: ubuntu budgie 22 4 3 desktop amd64 iso TF-IDF: budgie: 0.641 3: 0.464 22: 0.449 desktop: 0.221 amd64: 0.203 4: 0.188 iso: 0.172 ubuntu: 0.129 Original: ubuntu-16.10-desktop-i386.iso Preprocessed: ubuntu 16 10 desktop i386 iso TF-IDF: 16: 0.563 i386: 0.540 10: 0.433 desktop: 0.323 iso: 0.252 ubuntu: 0.189 Original: ubuntu-18.04.6-desktop-amd64.iso Preprocessed: ubuntu 18 4 6 desktop amd64 iso TF-IDF: 6: 0.631 18: 0.555 desktop: 0.289 amd64: 0.266 4: 0.246 iso: 0.226 ubuntu: 0.169 Original: ubuntu-16.04.6-desktop-i386.iso Preprocessed: ubuntu 16 4 6 desktop i386 iso TF-IDF: 6: 0.599 16: 0.478 i386: 0.458 desktop: 0.275 4: 0.234 iso: 0.214 ubuntu: 0.160 Original: ubuntu-18.04.4-desktop-amd64.iso Preprocessed: ubuntu 18 4 4 desktop amd64 iso TF-IDF: 18: 0.627 4: 0.555 desktop: 0.327 amd64: 0.301 iso: 0.255 ubuntu: 0.191 Original: ubuntu-18.04-desktop-amd64.iso Preprocessed: ubuntu 18 4 desktop amd64 iso TF-IDF: 18: 0.715 desktop: 0.373 amd64: 0.343 4: 0.317 iso: 0.291 ubuntu: 0.217 Original: ubuntu-18.10-desktop-amd64.iso Preprocessed: ubuntu 18 10 desktop amd64 iso TF-IDF: 18: 0.667 10: 0.466 desktop: 0.348 amd64: 0.320 iso: 0.271 ubuntu: 0.203 Original: ubuntu-14.04-desktop-i386.iso Preprocessed: ubuntu 14 4 desktop i386 iso TF-IDF: 14: 0.615 i386: 0.563 desktop: 0.338 4: 0.287 iso: 0.263 ubuntu: 0.197 Original: ubuntu Preprocessed: ubuntu TF-IDF: ubuntu: 1.000 Original: ubuntu-12.04.5-desktop-amd64.iso Preprocessed: ubuntu 12 4 5 desktop amd64 iso TF-IDF: 12: 0.598 5: 0.598 desktop: 0.284 amd64: 0.262 4: 0.242 iso: 0.222 ubuntu: 0.166 Original: ubuntu-22.04.2-desktop-amd64.iso Preprocessed: ubuntu 22 4 2 desktop amd64 iso TF-IDF: 2: 0.634 22: 0.569 desktop: 0.279 amd64: 0.257 4: 0.237 iso: 0.218 ubuntu: 0.163 Original: Ubuntu 11.10 Oneiric Ocelot Preprocessed: ubuntu 11 10 oneiric ocelot TF-IDF: ocelot: 0.591 oneiric: 0.591 11: 0.462 10: 0.273 ubuntu: 0.119 Original: ubuntu-22.04.3-live-server-amd64.iso Preprocessed: ubuntu 22 4 3 live server amd64 iso TF-IDF: 3: 0.515 22: 0.499 live: 0.470 server: 0.338 amd64: 0.225 4: 0.208 iso: 0.191 ubuntu: 0.143 Original: ubuntu-11.10-dvd-amd64.iso Preprocessed: ubuntu 11 10 dvd amd64 iso TF-IDF: dvd: 0.646 11: 0.586 10: 0.346 amd64: 0.237 iso: 0.201 ubuntu: 0.151 Original: ubuntu-14.04.6-desktop-i386.iso Preprocessed: ubuntu 14 4 6 desktop i386 iso TF-IDF: 6: 0.593 14: 0.496 i386: 0.453 desktop: 0.272 4: 0.231 iso: 0.212 ubuntu: 0.159 Original: ubuntu-20.04.1-desktop-amd64.iso Preprocessed: ubuntu 20 4 1 desktop amd64 iso TF-IDF: 1: 0.618 20: 0.522 desktop: 0.313 amd64: 0.288 4: 0.266 iso: 0.244 ubuntu: 0.183 Original: Ubuntu 9.10 Preprocessed: ubuntu 9 10 TF-IDF: 9: 0.864 10: 0.462 ubuntu: 0.201 Original: ubuntu-14.04.5-server-amd64.iso Preprocessed: ubuntu 14 4 5 server amd64 iso TF-IDF: 5: 0.603 14: 0.523 server: 0.395 amd64: 0.264 4: 0.244 iso: 0.224 ubuntu: 0.167 Original: ubuntu-14.04-desktop-amd64.iso Preprocessed: ubuntu 14 4 desktop amd64 iso TF-IDF: 14: 0.697 desktop: 0.382 amd64: 0.352 4: 0.325 iso: 0.298 ubuntu: 0.223 Original: ubuntu-12.04.5-desktop-i386.iso Preprocessed: ubuntu 12 4 5 desktop i386 iso TF-IDF: 12: 0.556 5: 0.556 i386: 0.441 desktop: 0.264 4: 0.225 iso: 0.206 ubuntu: 0.154 Original: ubuntu-10.10-xenon-beta5 Preprocessed: ubuntu 10 10 xenon beta5 TF-IDF: beta5: 0.588 xenon: 0.588 10: 0.542 ubuntu: 0.118 Original: Ubuntu Server Essentials - 6685 [ECLiPSE] Preprocessed: ubuntu server essential 6685 TF-IDF: 6685: 0.664 essential: 0.664 server: 0.315 ubuntu: 0.133 Original: ubuntu-14.04.6-desktop-amd64+mac.iso Preprocessed: ubuntu 14 4 6 desktop amd64 mac iso TF-IDF: mac: 0.617 6: 0.504 14: 0.421 desktop: 0.231 amd64: 0.213 4: 0.196 iso: 0.180 ubuntu: 0.135 Original: ubuntu-pack-16.04-unity Preprocessed: ubuntu pack 16 4 unity TF-IDF: unity: 0.638 pack: 0.599 16: 0.416 4: 0.203 ubuntu: 0.139 Original: ubuntu-19.10-desktop-amd64.iso Preprocessed: ubuntu 19 10 desktop amd64 iso TF-IDF: 19: 0.727 10: 0.429 desktop: 0.320 amd64: 0.295 iso: 0.250 ubuntu: 0.187 Original: ubuntu-14.04-server-i386.iso Preprocessed: ubuntu 14 4 server i386 iso TF-IDF: 14: 0.586 i386: 0.536 server: 0.443 4: 0.273 iso: 0.251 ubuntu: 0.187 Original: ubuntu-20.04.3-desktop-amd64.iso Preprocessed: ubuntu 20 4 3 desktop amd64 iso TF-IDF: 3: 0.642 20: 0.509 desktop: 0.305 amd64: 0.281 4: 0.259 iso: 0.238 ubuntu: 0.178 Original: ubuntu-18.04.3-live-server-amd64.iso Preprocessed: ubuntu 18 4 3 live server amd64 iso TF-IDF: 3: 0.522 18: 0.477 live: 0.477 server: 0.343 amd64: 0.228 4: 0.211 iso: 0.194 ubuntu: 0.145 Original: Ubuntu Server 20.04.2 LTS Preprocessed: ubuntu server 20 4 2 lts TF-IDF: lts: 0.661 2: 0.516 20: 0.379 server: 0.314 4: 0.193 ubuntu: 0.133 Original: Ubuntu 20.04.2.0 Desktop (64-bit) Preprocessed: ubuntu 20 4 2 0 desktop TF-IDF: 0: 0.621 2: 0.563 20: 0.414 desktop: 0.248 4: 0.211 ubuntu: 0.145 Original: ubuntu-21.10-desktop-amd64.iso Preprocessed: ubuntu 21 10 desktop amd64 iso TF-IDF: 21: 0.727 10: 0.429 desktop: 0.320 amd64: 0.295 iso: 0.250 ubuntu: 0.187 Original: Ubuntu Ultimate Edition 1.9 Preprocessed: ubuntu ultimate edition 1 9 TF-IDF: ultimate: 0.546 edition: 0.512 9: 0.512 1: 0.404 ubuntu: 0.119 Original: Ubuntu Facile 01 2014.pdf Preprocessed: ubuntu facile 1 2014 pdf TF-IDF: 2014: 0.561 pdf: 0.499 facile: 0.499 1: 0.415 ubuntu: 0.123 Original: Ubuntu Netbook Remix Preprocessed: ubuntu netbook remix TF-IDF: remix: 0.728 netbook: 0.670 ubuntu: 0.146 Original: Ubuntu 12.10 Desktop (i386) Preprocessed: ubuntu 12 10 desktop TF-IDF: 12: 0.765 10: 0.487 desktop: 0.364 ubuntu: 0.212 Original: ubuntu-22.04-live-server-amd64.iso Preprocessed: ubuntu 22 4 live server amd64 iso TF-IDF: 22: 0.582 live: 0.548 server: 0.394 amd64: 0.263 4: 0.243 iso: 0.223 ubuntu: 0.167 Original: ubuntu-16.04.6-server-amd64.iso Preprocessed: ubuntu 16 4 6 server amd64 iso TF-IDF: 6: 0.624 16: 0.498 server: 0.395 amd64: 0.263 4: 0.243 iso: 0.223 ubuntu: 0.167 Original: [Ubuntu] Anonymous OS 0.1 Preprocessed: anonymous o 0 1 TF-IDF: o: 0.559 anonymous: 0.559 0: 0.482 1: 0.380 Original: ubuntu-14.04-server-amd64.ova Preprocessed: ubuntu 14 4 server amd64 ovum TF-IDF: ovum: 0.736 14: 0.462 server: 0.350 amd64: 0.233 4: 0.215 ubuntu: 0.148 Original: ubuntu-20.04.2-desktop-amd64.iso Preprocessed: ubuntu 20 4 2 desktop amd64 iso TF-IDF: 2: 0.671 20: 0.493 desktop: 0.295 amd64: 0.272 4: 0.251 iso: 0.230 ubuntu: 0.172 Original: Ubuntu Facile - Aprile 2015.pdf Preprocessed: ubuntu facile aprile 2015 pdf TF-IDF: aprile: 0.557 2015: 0.512 pdf: 0.455 facile: 0.455 ubuntu: 0.112 Original: ubuntu-16.04.3-server-amd64.iso Preprocessed: ubuntu 16 4 3 server amd64 iso TF-IDF: 3: 0.610 16: 0.505 server: 0.400 amd64: 0.267 4: 0.247 iso: 0.226 ubuntu: 0.169 Original: ubuntu-20.04-desktop-amd64.iso Preprocessed: ubuntu 20 4 desktop amd64 iso TF-IDF: 20: 0.665 desktop: 0.398 amd64: 0.367 4: 0.339 iso: 0.311 ubuntu: 0.232 Original: ubuntu-16.04.7-server-amd64.iso Preprocessed: ubuntu 16 4 7 server amd64 iso TF-IDF: 7: 0.699 16: 0.456 server: 0.361 amd64: 0.241 4: 0.223 iso: 0.204 ubuntu: 0.153 Original: ubuntu-16.04.5-desktop-amd64.iso Preprocessed: ubuntu 16 4 5 desktop amd64 iso TF-IDF: 5: 0.635 16: 0.525 desktop: 0.302 amd64: 0.278 4: 0.257 iso: 0.235 ubuntu: 0.176 Original: Ubuntu Preprocessed: ubuntu TF-IDF: ubuntu: 1.000 Original: Ubuntu 20.04.3 (AMD64) (Server) Preprocessed: ubuntu 20 4 3 TF-IDF: 3: 0.732 20: 0.580 4: 0.296 ubuntu: 0.203 Original: ubuntu-20.04.4-desktop-amd64.iso Preprocessed: ubuntu 20 4 4 desktop amd64 iso TF-IDF: 4: 0.584 20: 0.573 desktop: 0.344 amd64: 0.316 iso: 0.268 ubuntu: 0.200 Original: ubuntu-14.04.4-desktop-amd64.iso Preprocessed: ubuntu 14 4 4 desktop amd64 iso TF-IDF: 14: 0.607 4: 0.566 desktop: 0.333 amd64: 0.307 iso: 0.260 ubuntu: 0.194 Original: ubuntu-11.04-desktop-amd64.iso Preprocessed: ubuntu 11 4 desktop amd64 iso TF-IDF: 11: 0.771 desktop: 0.340 amd64: 0.312 4: 0.289 iso: 0.265 ubuntu: 0.198 Original: ubuntu-14.10-desktop-i386.iso Preprocessed: ubuntu 14 10 desktop i386 iso TF-IDF: 14: 0.581 i386: 0.532 10: 0.427 desktop: 0.319 iso: 0.249 ubuntu: 0.186 Original: ubuntu-16.04.7-desktop-amd64.iso Preprocessed: ubuntu 16 4 7 desktop amd64 iso TF-IDF: 7: 0.722 16: 0.471 desktop: 0.270 amd64: 0.249 4: 0.230 iso: 0.211 ubuntu: 0.158 Original: ubuntu-19.10-live-server-amd64.iso Preprocessed: ubuntu 19 10 live server amd64 iso TF-IDF: 19: 0.600 live: 0.507 server: 0.364 10: 0.354 amd64: 0.243 iso: 0.206 ubuntu: 0.154 Original: ubuntu-unity-22.10-desktop-amd64.iso Preprocessed: ubuntu unity 22 10 desktop amd64 iso TF-IDF: unity: 0.671 22: 0.511 10: 0.336 desktop: 0.251 amd64: 0.231 iso: 0.196 ubuntu: 0.146 Original: ubuntu-mate-19.10-desktop-amd64.iso Preprocessed: ubuntu mate 19 10 desktop amd64 iso TF-IDF: mate: 0.606 19: 0.579 10: 0.342 desktop: 0.255 amd64: 0.235 iso: 0.199 ubuntu: 0.149 Original: ubuntu-13.04-desktop-i386.iso Preprocessed: ubuntu 13 4 desktop i386 iso TF-IDF: 13: 0.779 i386: 0.448 desktop: 0.268 4: 0.228 iso: 0.209 ubuntu: 0.156 Original: ubuntu-17.04-server-amd64.iso Preprocessed: ubuntu 17 4 server amd64 iso TF-IDF: 17: 0.786 server: 0.406 amd64: 0.271 4: 0.250 iso: 0.229 ubuntu: 0.172 Original: Ubuntu reducido Preprocessed: ubuntu reducido TF-IDF: reducido: 0.980 ubuntu: 0.197 Original: ubuntu-11.10-desktop-i386.iso Preprocessed: ubuntu 11 10 desktop i386 iso TF-IDF: 11: 0.664 i386: 0.488 10: 0.392 desktop: 0.293 iso: 0.228 ubuntu: 0.171 Original: ubuntu-16.04.6-server-i386.iso Preprocessed: ubuntu 16 4 6 server i386 iso TF-IDF: 6: 0.580 16: 0.463 i386: 0.444 server: 0.367 4: 0.226 iso: 0.207 ubuntu: 0.155 Original: Ubuntu-Book_RU.djvu Preprocessed: ubuntu book ru djvu TF-IDF: djvu: 0.574 ru: 0.574 book: 0.574 ubuntu: 0.115 Original: ubuntu-20.04.4-live-server-amd64.iso Preprocessed: ubuntu 20 4 4 live server amd64 iso TF-IDF: live: 0.531 4: 0.470 20: 0.462 server: 0.382 amd64: 0.255 iso: 0.216 ubuntu: 0.161 Original: ubuntu-18.04-live-server-amd64.iso Preprocessed: ubuntu 18 4 live server amd64 iso TF-IDF: 18: 0.559 live: 0.559 server: 0.402 amd64: 0.268 4: 0.247 iso: 0.227 ubuntu: 0.170 Original: ubuntu-15.04-desktop-i386.iso Preprocessed: ubuntu 15 4 desktop i386 iso TF-IDF: 15: 0.731 i386: 0.487 desktop: 0.292 4: 0.248 iso: 0.228 ubuntu: 0.170 Original: ubuntu-15.04-desktop-amd64.iso Preprocessed: ubuntu 15 4 desktop amd64 iso TF-IDF: 15: 0.800 desktop: 0.320 amd64: 0.294 4: 0.272 iso: 0.249 ubuntu: 0.186 Original: ubuntu-mate-20.04.4-desktop-amd64.iso Preprocessed: ubuntu mate 20 4 4 desktop amd64 iso TF-IDF: mate: 0.632 4: 0.452 20: 0.444 desktop: 0.266 amd64: 0.245 iso: 0.208 ubuntu: 0.155 Original: ubuntu-mate-20.04.3-desktop-amd64.iso Preprocessed: ubuntu mate 20 4 3 desktop amd64 iso TF-IDF: mate: 0.587 3: 0.520 20: 0.412 desktop: 0.247 amd64: 0.227 4: 0.210 iso: 0.193 ubuntu: 0.144 Original: ubuntu-14.10-server-amd64.iso Preprocessed: ubuntu 14 10 server amd64 iso TF-IDF: 14: 0.614 server: 0.465 10: 0.451 amd64: 0.310 iso: 0.263 ubuntu: 0.196 Original: ubuntu-21.04-live-server-amd64.iso Preprocessed: ubuntu 21 4 live server amd64 iso TF-IDF: 21: 0.623 live: 0.527 server: 0.379 amd64: 0.253 4: 0.233 iso: 0.214 ubuntu: 0.160 Original: ubuntu-18.10-server-amd64.iso Preprocessed: ubuntu 18 10 server amd64 iso TF-IDF: 18: 0.634 server: 0.455 10: 0.442 amd64: 0.304 iso: 0.257 ubuntu: 0.193 Original: Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 Preprocessed: ubuntu 20 4 x64 untouched david1893 TF-IDF: david1893: 0.538 untouched: 0.538 x64: 0.538 20: 0.309 4: 0.157 ubuntu: 0.108 Original: ubuntu-16.10-desktop-amd64.iso Preprocessed: ubuntu 16 10 desktop amd64 iso TF-IDF: 16: 0.631 10: 0.485 desktop: 0.362 amd64: 0.333 iso: 0.283 ubuntu: 0.211 Original: ubuntu-22.04-desktop-amd64.iso Preprocessed: ubuntu 22 4 desktop amd64 iso TF-IDF: 22: 0.735 desktop: 0.361 amd64: 0.332 4: 0.307 iso: 0.282 ubuntu: 0.211 Original: Ubuntu-18.04 Preprocessed: ubuntu 18 4 TF-IDF: 18: 0.881 4: 0.390 ubuntu: 0.268 Original: Ubuntu 10.04 Netbook Preprocessed: ubuntu 10 4 netbook TF-IDF: netbook: 0.845 10: 0.424 4: 0.269 ubuntu: 0.185 Original: ubuntu-20.04-live-server-amd64.iso Preprocessed: ubuntu 20 4 live server amd64 iso TF-IDF: live: 0.582 20: 0.506 server: 0.418 amd64: 0.279 4: 0.258 iso: 0.236 ubuntu: 0.177 Original: ubuntu-21.04-desktop-amd64.iso Preprocessed: ubuntu 21 4 desktop amd64 iso TF-IDF: 21: 0.771 desktop: 0.340 amd64: 0.312 4: 0.289 iso: 0.265 ubuntu: 0.198 Original: Ubuntu Facile Marzo 2015.pdf Preprocessed: ubuntu facile marzo 2015 pdf TF-IDF: marzo: 0.557 2015: 0.512 pdf: 0.455 facile: 0.455 ubuntu: 0.112 Original: ubuntu-20.10-desktop-amd64.iso Preprocessed: ubuntu 20 10 desktop amd64 iso TF-IDF: 20: 0.614 10: 0.493 desktop: 0.368 amd64: 0.339 iso: 0.287 ubuntu: 0.215 Original: ubuntu-14.10-desktop-amd64.iso Preprocessed: ubuntu 14 10 desktop amd64 iso TF-IDF: 14: 0.648 10: 0.476 desktop: 0.355 amd64: 0.327 iso: 0.277 ubuntu: 0.207 Original: Ubuntu 16.10 Preprocessed: ubuntu 16 10 TF-IDF: 16: 0.766 10: 0.590 ubuntu: 0.257 Original: ubuntu-12.04.5-dvd-i386.iso Preprocessed: ubuntu 12 4 5 dvd i386 iso TF-IDF: dvd: 0.566 12: 0.475 5: 0.475 i386: 0.377 4: 0.192 iso: 0.176 ubuntu: 0.132 Original: ubuntu-14.04.1-server-amd64.iso Preprocessed: ubuntu 14 4 1 server amd64 iso TF-IDF: 1: 0.579 14: 0.534 server: 0.404 amd64: 0.270 4: 0.249 iso: 0.229 ubuntu: 0.171 Original: ubuntu-21.10-beta-pack Preprocessed: ubuntu 21 10 beta pack TF-IDF: beta: 0.587 pack: 0.551 21: 0.499 10: 0.294 ubuntu: 0.128 Original: ubuntu-16.10-server-arm64.iso Preprocessed: ubuntu 16 10 server arm64 iso TF-IDF: arm64: 0.724 16: 0.434 server: 0.344 10: 0.334 iso: 0.194 ubuntu: 0.145 Original: Ubuntu Satanic Edition 666.4 Preprocessed: ubuntu satanic edition 666 4 TF-IDF: 666: 0.590 satanic: 0.590 edition: 0.509 4: 0.173 ubuntu: 0.119 Original: ubuntu-12.10-desktop-i386.iso Preprocessed: ubuntu 12 10 desktop i386 iso TF-IDF: 12: 0.636 i386: 0.504 10: 0.405 desktop: 0.302 iso: 0.236 ubuntu: 0.176 Original: ubuntu-19.04-desktop-amd64.iso Preprocessed: ubuntu 19 4 desktop amd64 iso TF-IDF: 19: 0.771 desktop: 0.340 amd64: 0.312 4: 0.289 iso: 0.265 ubuntu: 0.198 Original: ubuntu-18.04.5-live-server-amd64.iso Preprocessed: ubuntu 18 4 5 live server amd64 iso TF-IDF: 5: 0.522 18: 0.477 live: 0.477 server: 0.343 amd64: 0.228 4: 0.211 iso: 0.194 ubuntu: 0.145 Original: Ubuntu Facile 04 2014.pdf Preprocessed: ubuntu facile 4 2014 pdf TF-IDF: 2014: 0.605 pdf: 0.538 facile: 0.538 4: 0.193 ubuntu: 0.132 Clustering... Clustering results by cluster, including top features and their weights: Cluster 0 (Top Features: 10 (0.323), netbook (0.216), 9 (0.195), ubuntu (0.145), remix (0.104)): Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.835) Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.903) Ubuntu 9.10 (Distance to Centroid: 0.771) ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.839) Ubuntu Netbook Remix (Distance to Centroid: 0.896) Ubuntu 10.04 Netbook (Distance to Centroid: 0.757) ubuntu-21.10-beta-pack (Distance to Centroid: 0.896) Cluster 1 (Top Features: 16 (0.529), iso (0.180), ubuntu (0.177), 4 (0.175), i386 (0.144)): Ubuntu-16.04.5 (Distance to Centroid: 0.750) ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.590) ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.630) ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.652) ubuntu-pack-16.04-unity (Distance to Centroid: 0.914) ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.654) ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.723) ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.724) ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.667) ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.721) ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.658) ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.600) Ubuntu 16.10 (Distance to Centroid: 0.671) ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.820) Cluster 2 (Top Features: live (0.518), server (0.372), 4 (0.256), amd64 (0.248), iso (0.210)): ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.670) ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.603) ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.553) ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.531) ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.492) ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.458) ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.620) ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.486) ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.605) Cluster 3 (Top Features: 10 (0.239), desktop (0.233), 12 (0.209), i386 (0.209), iso (0.208)): ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.865) ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.840) ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.931) ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.855) ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.911) ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.793) ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.789) ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.802) ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.769) ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.890) ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.723) ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.802) Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.737) ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.750) ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.873) ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.882) ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.733) ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.861) ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.823) ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.607) Cluster 4 (Top Features: pdf (0.278), facile (0.278), 1 (0.237), ultimate (0.167), 2014 (0.167)): ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.880) Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.895) Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.621) [Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.962) Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 0.762) Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 0.762) Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.707) Cluster 5 (Top Features: 20 (0.490), 4 (0.264), desktop (0.239), amd64 (0.182), iso (0.173)): Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.655) ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.676) ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.609) ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.578) Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.879) Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.748) ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.567) ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.441) Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.738) ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.474) ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.640) ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.679) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.995) ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.653) Cluster 6 (Top Features: ubuntu (0.228), 4 (0.205), 14 (0.200), amd64 (0.176), desktop (0.173)): Ubuntu Linux ebook pack (Distance to Centroid: 1.061) ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.810) ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.782) Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.063) Ubuntu Linux основы администрирования (Distance to Centroid: 1.062) ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.904) ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.793) ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.716) ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.732) ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.743) ubuntu (Distance to Centroid: 0.900) ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.850) ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.790) ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.798) ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.630) ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.812) ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.799) ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.873) Ubuntu (Distance to Centroid: 0.900) ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.627) ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.839) Ubuntu reducido (Distance to Centroid: 1.055) Ubuntu-Book_RU.djvu (Distance to Centroid: 1.072) ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.855) ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.803) ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.773) Ubuntu-18.04 (Distance to Centroid: 0.890) ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.839) ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.744) ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.763) Ubuntu Satanic Edition 666.4 (Distance to Centroid: 1.030) Cluster 7 (Top Features: 19 (0.379), server (0.268), amd64 (0.249), iso (0.211), 10 (0.174)): ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.515) ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.851) Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.017) ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.602) ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.595) ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.708) ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.860) ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.795) ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.622) ``` The script: ```python from collections import defaultdict from pathlib import Path import nltk from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import string def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r ) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w+\b') X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Cluster using K-means kmeans = KMeans(random_state=42) kmeans.fit(X) # Getting cluster centroids centroids = kmeans.cluster_centers_ # Identifying key words for each cluster and storing them in a dictionary feature_names = vectorizer.get_feature_names_out() cluster_top_features_with_weights = {} for i, centroid in enumerate(centroids): sorted_feature_indices = centroid.argsort()[::-1] top_n = 5 # Number of key words top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]] cluster_top_features_with_weights[i] = top_features_with_weights # Output clustering results by cluster, including top features labels = kmeans.labels_ clusters = defaultdict(list) # Grouping titles by their clusters for i, label in enumerate(labels): clusters[label].append(results[i]) # Calculate distances of each point to cluster centroids distances_to_centroids = kmeans.transform(X) # Printing clustering results by cluster, including top features for each cluster print("\nClustering results by cluster, including top features and their weights:") for cluster in sorted(clusters.keys()): top_features_str = ', '.join( [f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]]) print(f"\nCluster {cluster} (Top Features: {top_features_str}):") for title in clusters[cluster]: # Find the index of the current title title_index = results.index(title) # Calculate "fit" metric as the distance to the centroid of its cluster # The distance itself is used as a metric of fit fit_metric = distances_to_centroids[title_index, cluster] print(f"\t{title} (Distance to Centroid: {fit_metric:.3f})") ```
drew2a commented 7 months ago

Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?

I've slightly modified the pattern for the TfidfVectorizer from (?u)\b\w+\b to (?u)\b\d+\b (which means "use only digits"), and obtained results quite close to what was requested:

``` Cluster 0 (Top Features: 17 (0.923), 10 (0.224), 4 (0.152), 9 (0.000), 2 (0.000)): ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.272) ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.272) Cluster 1 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)): ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279) ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279) ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279) ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279) ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442) ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554) ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554) ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554) ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655) ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655) ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685) ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708) Cluster 2 (Top Features: 20 (0.660), 4 (0.378), 2 (0.170), 3 (0.140), 1 (0.091)): ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.352) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.352) ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.352) ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.423) ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.423) ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.423) Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.644) ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.644) ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.651) ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.651) Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.651) ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.683) Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.683) ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.752) ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.776) Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.776) Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)): ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249) ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249) ubuntu-21.10-beta-pack (Distance to Centroid: 0.249) ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373) ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373) Cluster 4 (Top Features: 10 (0.167), 4 (0.143), 12 (0.123), 19 (0.101), 11 (0.101)): Ubuntu Linux ebook pack (Distance to Centroid: 0.325) Ubuntu (Distance to Centroid: 0.325) ubuntu (Distance to Centroid: 0.325) Ubuntu-Book_RU.djvu (Distance to Centroid: 0.325) Ubuntu reducido (Distance to Centroid: 0.325) Ubuntu Linux основы администрирования (Distance to Centroid: 0.325) Ubuntu Netbook Remix (Distance to Centroid: 0.325) Ubuntu 10.04 Netbook (Distance to Centroid: 0.818) ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.847) Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.847) ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.857) ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.872) ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.872) Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.872) ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.872) ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.872) ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.872) ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.877) ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.878) ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.892) ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.892) ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.892) ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.903) ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903) ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.903) ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903) Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.920) Ubuntu 9.10 (Distance to Centroid: 0.920) ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.937) ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.938) ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944) ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944) ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944) Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.968) ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.969) Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.971) Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.983) ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.992) Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.992) [Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 1.000) Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.007) Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.007) Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.030) Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.030) Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)): ubuntu-pack-16.04-unity (Distance to Centroid: 0.416) ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416) ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552) ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552) Ubuntu 16.10 (Distance to Centroid: 0.552) ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552) ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645) ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645) ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645) ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694) Ubuntu-16.04.5 (Distance to Centroid: 0.694) ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747) ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760) ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760) Cluster 6 (Top Features: 18 (0.772), 4 (0.302), 10 (0.114), 6 (0.072), 5 (0.071)): Ubuntu-18.04 (Distance to Centroid: 0.252) ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.252) ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.252) ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.404) ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.569) ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.569) ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.648) ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.671) ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.671) ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.684) Cluster 7 (Top Features: 22 (0.773), 4 (0.235), 3 (0.173), 10 (0.137), 2 (0.090)): ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.330) ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.330) ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.524) ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.524) ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.561) ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.561) ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.638) ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.684) ``` The script: ```python from collections import defaultdict from pathlib import Path import nltk from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') import re from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import string def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list(sorted( r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r )) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b') X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Cluster using K-means kmeans = KMeans(random_state=42) kmeans.fit(X) # Getting cluster centroids centroids = kmeans.cluster_centers_ # Identifying key words for each cluster and storing them in a dictionary feature_names = vectorizer.get_feature_names_out() cluster_top_features_with_weights = {} for i, centroid in enumerate(centroids): sorted_feature_indices = centroid.argsort()[::-1] top_n = 5 # Number of key words top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]] cluster_top_features_with_weights[i] = top_features_with_weights # Output clustering results by cluster, including top features labels = kmeans.labels_ clusters = defaultdict(list) # Grouping titles by their clusters for i, label in enumerate(labels): clusters[label].append(results[i]) # Calculate distances of each point to cluster centroids distances_to_centroids = kmeans.transform(X) # Printing clustering results by cluster, including top features for each cluster print("\nClustering results by cluster, including top features and their weights:") for cluster in sorted(clusters.keys()): top_features_str = ', '.join( f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster] ) print(f"\nCluster {cluster} (Top Features: {top_features_str}):") # Prepare a list to hold titles and their distances titles_and_distances = [] for title in clusters[cluster]: # Find the index of the current title title_index = results.index(title) # Calculate "fit" metric as the distance to the centroid of its cluster fit_metric = distances_to_centroids[title_index, cluster] # Add title and its distance to the list titles_and_distances.append((title, fit_metric)) # Sort titles within the cluster by their distance to the centroid (ascending order) sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1]) # Print sorted titles by their distance to centroid for title, distance in sorted_titles_and_distances: print(f"\t{title} (Distance to Centroid: {distance:.3f})") ```
drew2a commented 7 months ago

In the previous example, elements were grouped fairly well, but there was a cluster containing elements close to noise (Cluster 4). To identify such a cluster (and filter it out in the future), we attempted to calculate the intra-cluster dispersion. This approach aimed to quantify the cohesion within each cluster by measuring the average distance of points from their cluster centroid. The rationale behind this method is that a cluster with a higher average distance among its points might be less cohesive and potentially contain more noise, making it a candidate for exclusion from further analysis.

To implement this, we first grouped the indices of the elements belonging to each cluster. Then, for each cluster, we constructed a matrix of its points by vertically stacking the corresponding rows from the TF-IDF matrix X using the indices we had collected. This allowed us to calculate pairwise distances between each point in a cluster and its centroid, using the pairwise_distances function. By computing the mean of these distances, we obtained a measure of intra-cluster dispersion.

The calculated average distances provided a clear metric to assess the tightness of each cluster. Clusters with lower average distances were deemed more cohesive, indicating that their elements were closely related to each other and to the cluster's overall theme. Conversely, clusters with higher average distances were scrutinized for potential exclusion, as their wide dispersion suggested a lack of a unifying theme or the presence of outlier elements. This methodological adjustment offered a systematic way to identify and potentially remove clusters that detract from the clarity and relevance of the clustering outcome, thereby refining the analysis.

``` Cluster 0 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)): Intra-cluster distance: 0.494 ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279) ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279) ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279) ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279) ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442) ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554) ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554) ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554) ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655) ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655) ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685) ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708) Cluster 1 (Top Features: 4 (0.174), 10 (0.162), 18 (0.153), 22 (0.119), 19 (0.097)): Intra-cluster distance: 0.827 Ubuntu (Distance to Centroid: 0.345) Ubuntu Linux ebook pack (Distance to Centroid: 0.345) Ubuntu Linux основы администрирования (Distance to Centroid: 0.345) Ubuntu Netbook Remix (Distance to Centroid: 0.345) Ubuntu reducido (Distance to Centroid: 0.345) Ubuntu-Book_RU.djvu (Distance to Centroid: 0.345) ubuntu (Distance to Centroid: 0.345) ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.812) Ubuntu 10.04 Netbook (Distance to Centroid: 0.812) ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.826) ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.826) Ubuntu-18.04 (Distance to Centroid: 0.835) ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.835) ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.835) ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.861) ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.861) ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.870) ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.874) ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.874) ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.887) ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.887) ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.887) ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.892) ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.894) ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.894) ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.894) ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.897) ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903) ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903) ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.922) ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944) ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944) ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944) Ubuntu 9.10 (Distance to Centroid: 0.948) Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.948) ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.950) ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.950) ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.968) ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.968) Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.987) Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.991) ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.991) Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.016) Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.016) Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.037) Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.037) Cluster 2 (Top Features: 1 (0.694), 4 (0.200), 20 (0.153), 2014 (0.101), 9 (0.098)): Intra-cluster distance: 0.627 ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.394) Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.518) ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.518) ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.639) ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.656) Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.759) [Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.759) Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.776) Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)): Intra-cluster distance: 0.299 ubuntu-21.10-beta-pack (Distance to Centroid: 0.249) ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249) ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249) ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373) ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373) Cluster 4 (Top Features: 12 (0.776), 5 (0.291), 4 (0.261), 10 (0.153), 9 (0.000)): Intra-cluster distance: 0.466 ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.380) ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.429) ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.429) ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.429) ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.493) Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.552) ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.552) Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)): Intra-cluster distance: 0.616 ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416) ubuntu-pack-16.04-unity (Distance to Centroid: 0.416) Ubuntu 16.10 (Distance to Centroid: 0.552) ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552) ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552) ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552) ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645) ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645) ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645) Ubuntu-16.04.5 (Distance to Centroid: 0.694) ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694) ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747) ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760) ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760) Cluster 6 (Top Features: 11 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)): Intra-cluster distance: 0.299 Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.249) ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.249) ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.249) ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.373) ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.373) Cluster 7 (Top Features: 20 (0.666), 4 (0.388), 2 (0.194), 3 (0.160), 0 (0.093)): Intra-cluster distance: 0.556 Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.359) ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.359) ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.359) ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426) ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.426) ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426) Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.624) ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.624) Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.637) ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637) ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637) ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.757) Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.758) ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.758) ``` The script: ```python import re import string from collections import defaultdict from pathlib import Path import nltk import numpy as np from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import pairwise_distances from scipy.sparse import vstack # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b') X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Cluster using K-means kmeans = KMeans(random_state=42) kmeans.fit(X) # Getting cluster centroids centroids = kmeans.cluster_centers_ # Output clustering results by cluster, including top features labels = kmeans.labels_ clusters = defaultdict(list) clusters_indices = defaultdict(list) intra_cluster_distances = defaultdict(list) # Grouping titles by their clusters for i, label in enumerate(labels): clusters[label].append(results[i]) clusters_indices[label].append(i) for cluster, indices in clusters_indices.items(): if indices: points_matrix = vstack([X.getrow(i) for i in indices]) distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean') intra_cluster_distance = np.mean(distances) intra_cluster_distances[cluster] = intra_cluster_distance # Identifying key words for each cluster and storing them in a dictionary feature_names = vectorizer.get_feature_names_out() cluster_top_features_with_weights = {} for i, centroid in enumerate(centroids): sorted_feature_indices = centroid.argsort()[::-1] top_n = 5 # Number of key words top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]] cluster_top_features_with_weights[i] = top_features_with_weights # Calculate distances of each point to cluster centroids distances_to_centroids = kmeans.transform(X) # Printing clustering results by cluster, including top features for each cluster print("\nClustering results by cluster, including top features and their weights:") for cluster in sorted(clusters.keys()): top_features_str = ', '.join( f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster] ) intra_cluster_distance = intra_cluster_distances[cluster] print(f"\nCluster {cluster} (Top Features: {top_features_str}):") print(f'Intra-cluster distance: {intra_cluster_distance:.3f}') # Prepare a list to hold titles and their distances titles_and_distances = [] for title in clusters[cluster]: # Find the index of the current title title_index = results.index(title) # Calculate "fit" metric as the distance to the centroid of its cluster fit_metric = distances_to_centroids[title_index, cluster] # Add title and its distance to the list titles_and_distances.append((title, fit_metric)) # Sort titles within the cluster by their distance to the centroid (ascending order) sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1]) # Print sorted titles by their distance to centroid for title, distance in sorted_titles_and_distances: print(f"\t{title} (Distance to Centroid: {distance:.3f})") ```
drew2a commented 7 months ago

Another metric that can be utilized for filtering clusters is the silhouette coefficient(the coefficient values range from -1 to 1). This metric provides insight into the distance between clusters and the cohesion within them. By calculating the silhouette coefficient for each sample within the dataset, we gain the ability to evaluate not just the overall clustering performance but also the individual performance of each cluster. This granular analysis is crucial for identifying clusters that may not be well-defined or might contain elements that are essentially outliers, potentially skewing the overall analysis.

To implement this, we first used the silhouette_samples function from scikit-learn, which computes the silhouette coefficient for each sample, giving us a detailed breakdown of how well each sample fits within its assigned cluster compared to neighboring clusters. By aggregating these scores on a per-cluster basis, we were able to compute an average silhouette score for each cluster. This average score serves as a proxy for the cluster's quality, with higher scores indicating tighter and more distinct clusters, and lower scores suggesting clusters with overlapping or diffuse boundaries.

This approach allowed us to systematically evaluate each cluster's integrity. Clusters with low average silhouette scores were flagged for further exclusion.

``` Cluster 0 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)): Intra-cluster distance: 0.494 (the less the better) Average Silhouette Score = 0.429 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601) ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601) ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279, Silhouette Score: 0.601) ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601) ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442, Silhouette Score: 0.451) ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442) ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442) ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442) ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655, Silhouette Score: 0.343) ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655, Silhouette Score: 0.343) ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685, Silhouette Score: 0.056) ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708, Silhouette Score: 0.224) Cluster 1 (Top Features: 4 (0.174), 10 (0.162), 18 (0.153), 22 (0.119), 19 (0.097)): Intra-cluster distance: 0.827 (the less the better) Average Silhouette Score = 0.056 (the higher the better) Ubuntu (Distance to Centroid: 0.345, Silhouette Score: 0.133) Ubuntu Linux ebook pack (Distance to Centroid: 0.345, Silhouette Score: 0.133) Ubuntu Linux основы администрирования (Distance to Centroid: 0.345, Silhouette Score: 0.133) Ubuntu Netbook Remix (Distance to Centroid: 0.345, Silhouette Score: 0.133) Ubuntu reducido (Distance to Centroid: 0.345, Silhouette Score: 0.133) Ubuntu-Book_RU.djvu (Distance to Centroid: 0.345, Silhouette Score: 0.133) ubuntu (Distance to Centroid: 0.345, Silhouette Score: 0.133) ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.812, Silhouette Score: 0.050) Ubuntu 10.04 Netbook (Distance to Centroid: 0.812, Silhouette Score: -0.024) ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.826, Silhouette Score: 0.089) ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.826, Silhouette Score: 0.089) Ubuntu-18.04 (Distance to Centroid: 0.835, Silhouette Score: 0.100) ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.835, Silhouette Score: 0.100) ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.835, Silhouette Score: 0.100) ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.861, Silhouette Score: 0.072) ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.861, Silhouette Score: 0.072) ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.870, Silhouette Score: 0.016) ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.874, Silhouette Score: 0.062) ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.874, Silhouette Score: 0.062) ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075) ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075) ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075) ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.892, Silhouette Score: -0.054) ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: -0.041) ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: 0.014) ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: 0.014) ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.897, Silhouette Score: 0.025) ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903, Silhouette Score: 0.068) ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903, Silhouette Score: 0.068) ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.922, Silhouette Score: -0.031) ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054) ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054) ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054) Ubuntu 9.10 (Distance to Centroid: 0.948, Silhouette Score: 0.031) Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.948, Silhouette Score: 0.031) ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.950, Silhouette Score: 0.026) ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.950, Silhouette Score: 0.026) ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.968, Silhouette Score: 0.026) ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.968, Silhouette Score: 0.026) Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.987, Silhouette Score: -0.025) Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.991, Silhouette Score: 0.015) ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.991, Silhouette Score: 0.015) Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.016, Silhouette Score: 0.068) Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.016, Silhouette Score: 0.068) Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.037, Silhouette Score: 0.046) Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.037, Silhouette Score: 0.046) Cluster 2 (Top Features: 1 (0.694), 4 (0.200), 20 (0.153), 2014 (0.101), 9 (0.098)): Intra-cluster distance: 0.627 (the less the better) Average Silhouette Score = 0.217 (the higher the better) ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.394, Silhouette Score: 0.377) Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.518, Silhouette Score: 0.161) ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.518, Silhouette Score: 0.161) ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.639, Silhouette Score: 0.214) ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.656, Silhouette Score: 0.216) Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.759, Silhouette Score: 0.199) [Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.759, Silhouette Score: 0.216) Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.776, Silhouette Score: 0.195) Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)): Intra-cluster distance: 0.299 (the less the better) Average Silhouette Score = 0.712 (the higher the better) ubuntu-21.10-beta-pack (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643) ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643) Cluster 4 (Top Features: 12 (0.776), 5 (0.291), 4 (0.261), 10 (0.153), 9 (0.000)): Intra-cluster distance: 0.466 (the less the better) Average Silhouette Score = 0.506 (the higher the better) ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.380, Silhouette Score: 0.513) ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573) ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573) ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573) ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.493, Silhouette Score: 0.419) Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.552, Silhouette Score: 0.447) ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.552, Silhouette Score: 0.447) Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)): Intra-cluster distance: 0.616 (the less the better) Average Silhouette Score = 0.325 (the higher the better) ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416, Silhouette Score: 0.419) ubuntu-pack-16.04-unity (Distance to Centroid: 0.416, Silhouette Score: 0.419) Ubuntu 16.10 (Distance to Centroid: 0.552, Silhouette Score: 0.406) ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406) ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406) ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406) ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327) ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327) ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327) Ubuntu-16.04.5 (Distance to Centroid: 0.694, Silhouette Score: 0.206) ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694, Silhouette Score: 0.206) ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747, Silhouette Score: 0.176) ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760, Silhouette Score: 0.257) ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760, Silhouette Score: 0.257) Cluster 6 (Top Features: 11 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)): Intra-cluster distance: 0.299 (the less the better) Average Silhouette Score = 0.712 (the higher the better) Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758) ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643) ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643) Cluster 7 (Top Features: 20 (0.666), 4 (0.388), 2 (0.194), 3 (0.160), 0 (0.093)): Intra-cluster distance: 0.556 (the less the better) Average Silhouette Score = 0.390 (the higher the better) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.359, Silhouette Score: 0.498) ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.359, Silhouette Score: 0.498) ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.359, Silhouette Score: 0.498) ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466) ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466) ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466) Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.624, Silhouette Score: 0.356) ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.624, Silhouette Score: 0.356) Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.637, Silhouette Score: 0.367) ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637, Silhouette Score: 0.367) ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637, Silhouette Score: 0.367) ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.757, Silhouette Score: 0.227) Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.758, Silhouette Score: 0.266) ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.758, Silhouette Score: 0.266) ``` The script: ```python import re import string from collections import defaultdict from pathlib import Path import nltk import numpy as np from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from scipy.sparse import vstack from sklearn.cluster import KMeans from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import pairwise_distances, silhouette_samples # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b') X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Cluster using K-means kmeans = KMeans(random_state=42) kmeans.fit(X) # Getting cluster centroids centroids = kmeans.cluster_centers_ # Output clustering results by cluster, including top features labels = kmeans.labels_ clusters = defaultdict(list) clusters_indices = defaultdict(list) intra_cluster_distances = defaultdict(list) silhouette_vals = silhouette_samples(X, labels, metric='euclidean') cluster_silhouette_scores = defaultdict(list) # Grouping titles by their clusters for i, label in enumerate(labels): clusters[label].append(results[i]) clusters_indices[label].append(i) cluster_silhouette_scores[label].append(silhouette_vals[i]) for cluster, indices in clusters_indices.items(): if indices: points_matrix = vstack([X.getrow(i) for i in indices]) distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean') intra_cluster_distance = np.mean(distances) intra_cluster_distances[cluster] = intra_cluster_distance # Identifying key words for each cluster and storing them in a dictionary feature_names = vectorizer.get_feature_names_out() cluster_top_features_with_weights = {} for i, centroid in enumerate(centroids): sorted_feature_indices = centroid.argsort()[::-1] top_n = 5 # Number of key words top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]] cluster_top_features_with_weights[i] = top_features_with_weights # Calculate distances of each point to cluster centroids distances_to_centroids = kmeans.transform(X) # Printing clustering results by cluster, including top features for each cluster print("\nClustering results by cluster, including top features and their weights:") for cluster in sorted(clusters.keys()): top_features_str = ', '.join( f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster] ) intra_cluster_distance = intra_cluster_distances[cluster] print(f"\nCluster {cluster} (Top Features: {top_features_str}):") average_score = np.mean(cluster_silhouette_scores[cluster]) print(f'Intra-cluster distance: {intra_cluster_distance:.3f} (the less the better)') print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)") # Prepare a list to hold titles, their distances, and silhouette scores titles_distances_scores = [] for i, title_index in enumerate(clusters_indices[cluster]): title = results[title_index] fit_metric = distances_to_centroids[title_index, cluster] silhouette_score = silhouette_vals[title_index] titles_distances_scores.append((title, fit_metric, silhouette_score)) # Sort titles within the cluster by their distance to the centroid (ascending order) sorted_titles_distances_scores = sorted(titles_distances_scores, key=lambda x: x[1]) # Print sorted titles by their distance to centroid and include silhouette score for title, distance, silhouette_score in sorted_titles_distances_scores: print(f"\t{title} (Distance to Centroid: {distance:.3f}, Silhouette Score: {silhouette_score:.3f})") ```
drew2a commented 7 months ago

The best part of any job is the visualization: 3d_all_clusters

3d_each_cluster

2d_each_cluster

The script: ```python # This script performs cluster analysis using the K-Means algorithm, applied to a multi-dimensional dataset. # It includes steps for fitting the K-Means model, calculating and interpreting key metrics such as intra-cluster # distances and silhouette scores, and estimating the dimensional characteristics of each cluster. The aim is to # evaluate the cohesion and separation of clusters, identify the top features defining each cluster, and approximate # the "size" or spread of clusters through a novel approach based on calculating the volume of an orthogonal figure # formed by the furthest points in each cluster. The script provides a comprehensive overview of the clustering results, # offering insights into the data structure and the effectiveness of the clustering. import math import re import string from collections import defaultdict from pathlib import Path import matplotlib.pyplot as plt import nltk import numpy as np from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.cluster import KMeans from sklearn.decomposition import PCA, TruncatedSVD from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import euclidean_distances, pairwise_distances, silhouette_samples # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b') X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Cluster using K-means # Initialize KMeans clustering with a fixed random state to ensure reproducibility kmeans = KMeans(random_state=42) # Fit the model to the data X to perform the clustering kmeans.fit(X) # Retrieve the centroids of the clusters formed by KMeans centroids = kmeans.cluster_centers_ # Retrieve the labels assigned to each data point in X, indicating their cluster membership labels = kmeans.labels_ # Initialize dictionaries to store clustering results and distances clusters = defaultdict(list) clusters_indices = defaultdict(list) intra_cluster_distances = defaultdict(list) # Calculate the silhouette scores for each data point in X based on their cluster assignment silhouette_vals = silhouette_samples(X, labels, metric='euclidean') # Store silhouette scores for each cluster for later analysis cluster_silhouette_scores = defaultdict(list) # Loop through each data point and its cluster label for i, label in enumerate(labels): # Group data points by their cluster label, storing both the original titles and their indices clusters[label].append(results[i]) clusters_indices[label].append(i) # Accumulate silhouette scores by cluster cluster_silhouette_scores[label].append(silhouette_vals[i]) # For each cluster, calculate the average distance of points to the cluster's centroid for cluster, indices in clusters_indices.items(): if indices: # Convert the subset of X corresponding to the current cluster to a dense format points_matrix = X[indices, :].toarray() # Calculate pairwise Euclidean distances between points in the cluster and the cluster's centroid distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean') # Calculate and store the average intra-cluster distance as a measure of cluster cohesion intra_cluster_distance = np.mean(distances) intra_cluster_distances[cluster] = intra_cluster_distance # Identify the top features (words) for each cluster based on the centroids' coordinates feature_names = vectorizer.get_feature_names_out() cluster_top_features_with_weights = {} for i, centroid in enumerate(centroids): # Sort features in descending order of importance for the cluster sorted_feature_indices = centroid.argsort()[::-1] # Select the top N features for the cluster top_n = 5 top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]] # Store the top features and their weights for each cluster cluster_top_features_with_weights[i] = top_features_with_weights # Calculate distances from each data point to its cluster centroid distances_to_centroids = kmeans.transform(X) # Print summary information for each cluster, including its top features, average intra-cluster distance, and silhouette score print("\nClustering results by cluster, including top features and their weights:") for cluster in sorted(clusters.keys()): # Join the top features with their weights into a string for printing top_features_str = ', '.join( f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster] ) # Retrieve the average intra-cluster distance for the current cluster intra_cluster_distance = intra_cluster_distances[cluster] print(f"\nCluster {cluster} (Top Features: {top_features_str}):") # Calculate and print the average silhouette score for the cluster average_score = np.mean(cluster_silhouette_scores[cluster]) print(f'Intra-cluster distance: {intra_cluster_distance:.3f} (the less the better)') print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)") # Initialize a list to hold titles, their distances from the centroid, and silhouette scores titles_distances_scores = [] for i, title_index in enumerate(clusters_indices[cluster]): # Retrieve the title and its metrics title = results[title_index] fit_metric = distances_to_centroids[title_index, cluster] # The distance of the title from its cluster centroid silhouette_score = silhouette_vals[title_index] # The silhouette score of the title # Append the title and its metrics to the list titles_distances_scores.append((title, fit_metric, silhouette_score)) # Sort titles within the cluster by their distance to the centroid in ascending order sorted_titles_distances_scores = sorted(titles_distances_scores, key=lambda x: x[1]) # Print each title with its distance to the centroid and silhouette score for title, distance, silhouette_score in sorted_titles_distances_scores: print(f"\t{title} (Distance to Centroid: {distance:.3f}, Silhouette Score: {silhouette_score:.3f})") # PLOT 1 # Assuming X, centroids, clusters_indices are already defined fig, axs = plt.subplots(len(clusters.keys()), figsize=(10, 5 * len(clusters.keys())),) # Convert axs to a list if there's only one subplot to standardize the iteration if len(clusters.keys()) == 1: axs = [axs] # Setting the universal scale for X-axis from 0 to 1 x_min, x_max = 0, 1 for idx, cluster in enumerate(sorted(clusters.keys())): indices = clusters_indices[cluster] if indices: # Convert cluster points to dense format if necessary cluster_points = X[indices, :].toarray() # Calculate distances from each point in the cluster to its centroid distances = euclidean_distances(cluster_points, centroids[[cluster]]).flatten() # Plot the histogram of distances with a uniform X-axis scale axs[idx].hist(distances, bins=20, alpha=0.7, label=f'Cluster {cluster}', range=(x_min, x_max)) axs[idx].set_title(f'Distance to Centroid Distribution for Cluster {cluster}') axs[idx].set_xlabel('Distance to Centroid') axs[idx].set_ylabel('Number of Points') axs[idx].legend() plt.tight_layout() # PLOT 2 x_limits = (-0.6, 0.6) y_limits = (-0.6, 0.6) z_limits = (-0.6, 0.6) # Convert data to dense format and apply PCA to reduce dimensionality to 3 pca = PCA(n_components=3) X_pca_3d = pca.fit_transform(X.toarray()) fig = plt.figure(figsize=(12, 8)) ax = fig.add_subplot(111, projection='3d') ax.set_xlim(x_limits) ax.set_ylim(y_limits) ax.set_zlim(z_limits) # Get the color map for visualizing different clusters colors = plt.cm.get_cmap('tab10', len(clusters.keys())) for cluster in sorted(clusters.keys()): cluster_indices = clusters_indices[cluster] cluster_points = X_pca_3d[cluster_indices, :] clr = colors(cluster) ax.scatter(cluster_points[:, 0], cluster_points[:, 1], cluster_points[:, 2], color=clr, label=f'Cluster {cluster}', alpha=0.6) # Apply PCA to centroids to get their coordinates in 3D space centroids_pca_3d = pca.transform(centroids) # Draw centroids with corresponding colors for i, centroid in enumerate(centroids_pca_3d): ax.scatter(centroid[0], centroid[1], centroid[2], color=colors(i), marker='x', s=100, edgecolor='k', linewidths=2) ax.set_title('3D Cluster Visualization with PCA') ax.set_xlabel('PCA Component 1') ax.set_ylabel('PCA Component 2') ax.set_zlabel('PCA Component 3') ax.view_init(elev=20, azim=-35) plt.legend() # PLOT 3 # Determining the number of clusters n_clusters = len(clusters.keys()) # Calculating the optimal number of rows and columns for subplots rows = math.ceil(math.sqrt(n_clusters)) cols = math.ceil(n_clusters / rows) # Creating a figure for subplots fig = plt.figure(figsize=(cols * 6, rows * 5)) # Drawing each cluster in its own subplot for idx, cluster in enumerate(sorted(clusters.keys())): ax = fig.add_subplot(rows, cols, idx + 1, projection='3d') cluster_indices = clusters_indices[cluster] cluster_points = X_pca_3d[cluster_indices, :] centroids_pca_3d = pca.transform([centroids[cluster]]) ax.set_xlim(x_limits) ax.set_ylim(y_limits) ax.set_zlim(z_limits) # Visualizing cluster points ax.scatter(cluster_points[:, 0], cluster_points[:, 1], cluster_points[:, 2], label=f'Cluster {cluster}', alpha=0.6) # Visualizing the centroid ax.scatter(centroids_pca_3d[:, 0], centroids_pca_3d[:, 1], centroids_pca_3d[:, 2], color='black', marker='x', s=100, label='Centroid') ax.set_title(f'Cluster {cluster}') ax.set_xlabel('PCA Component 1') ax.set_ylabel('PCA Component 2') ax.set_zlabel('PCA Component 3') ax.view_init(elev=20, azim=-35) plt.tight_layout() plt.show() ```
drew2a commented 6 months ago

Instead of integrating the current algorithm into Tribler, I decided to focus on its improvement and dedicate half of the current week to this task.

I haven't yet focused on measuring the algorithm's performance because I want to first ensure that the clustering results are as accurate as possible. There are two main areas I'm currently working on to improve the quality of the clustering:

  1. Figuring out how to determine the optimal number of clusters, which is crucial for accurately grouping the data.
  2. Incorporating the position of words within the text into the algorithm, which I believe will greatly enhance the quality of the results.

Once I'm confident that the algorithm is producing the best possible clustering outcomes, I'll turn my attention to optimizing its performance.

So, the next iteration of the algorithm contains two modifications:

Transition from KMeans to HDBSCAN for Clustering

Initially, our algorithm employed KMeans for clustering, which necessitates specifying the number of clusters a priori. This requirement posed a significant limitation, as determining the optimal number of clusters is not straightforward and can vary significantly depending on the dataset's nature and size. To address this challenge, we transitioned to using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). Unlike KMeans, HDBSCAN does not require pre-specification of the number of clusters. Instead, it dynamically identifies clusters based on data density, offering several advantages: Adaptability: HDBSCAN adapts to the inherent structure of the data, leading to more meaningful and natural groupings. Noise Handling: It effectively identifies and isolates noise, improving the overall quality of the clusters. Unlike KMeans, where every point is assigned to a cluster regardless of how well it fits, HDBSCAN can leave points unassigned (labeled as -1) Variable Cluster Sizes: The algorithm accommodates clusters of varying densities and sizes, aligning closer with real-world data distributions.

This shift aims to achieve more accurate and representative clustering by leveraging the data's natural structure, potentially enhancing the user experience through more precise content categorization.

Incorporating N-Grams into TFIDF Vectorization

The original vectorization approach using TFIDF (Term Frequency-Inverse Document Frequency) focused on individual terms without considering the order or proximity of words. To capture the contextual nuances and the sequence in which terms appear, we integrated n-grams into our TFIDF vectorization (TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2))). N-grams are contiguous sequences of n items from a given sample of text or speech. By incorporating n-grams:

Contextual Awareness: The algorithm can now recognize and give weight to term proximity and order, capturing more nuanced meanings. Feature Enrichment: Including n-grams expands the feature set with phrase-level information, which is particularly beneficial for understanding the context and thematic content. Quality Improvement: This adjustment is anticipated to significantly enhance the quality of clustering by providing a richer, more contextually informed feature set for analysis.

``` Clustering results by cluster: Cluster 14 (features: ): Average Silhouette Score = 1.000 (the higher the better) Ubuntu (Silhouette Score: 1.000) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000) Ubuntu Linux ebook pack (Silhouette Score: 1.000) Ubuntu Linux основы администрирования (Silhouette Score: 1.000) Ubuntu Netbook Remix (Silhouette Score: 1.000) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 1.000) Ubuntu Unleashed 2019 Edition (Silhouette Score: 1.000) Ubuntu reducido (Silhouette Score: 1.000) Ubuntu-Book_RU.djvu (Silhouette Score: 1.000) ubuntu (Silhouette Score: 1.000) Cluster 31 (features: 20 4: 3.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000) ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 13 (features: 11 10: 3.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000) ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000) Cluster 12 (features: 12 10: 2.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000) ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000) Cluster 11 (features: 16 10: 4.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 16.10 (Silhouette Score: 1.000) ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.10-server-arm64.iso (Silhouette Score: 1.000) Cluster 18 (features: 4 1: 1.600, 20 4: 1.200): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 26 (features: 2 0: 1.371, 4 2: 1.165, 20 4: 0.874): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 32 (features: 4 3: 2.332, 20 4: 1.888): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000) ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 10 (features: 9 10: 2.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 9.10 (Silhouette Score: 1.000) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000) Cluster 25 (features: 4 2: 1.600, 20 4: 1.200): Average Silhouette Score = 1.000 (the higher the better) Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 23 (features: 4 5: 1.477, 16 4: 1.348): Average Silhouette Score = 1.000 (the higher the better) Ubuntu-16.04.5 (Silhouette Score: 1.000) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 27 (features: 18 4: 3.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu-18.04 (Silhouette Score: 1.000) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 2 (features: 11 4: 2.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 1 (features: 14 10: 3.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 22 (features: 16 4: 2.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-pack-16.04-unity (Silhouette Score: 1.000) Cluster 17 (features: 4 6: 2.252, 16 4: 1.982): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000) Cluster 16 (features: 4 7: 1.624, 16 4: 1.167): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000) Cluster 3 (features: 18 10: 2.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 6 (features: 19 4: 2.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000) Cluster 7 (features: 19 10: 3.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 24 (features: 4 4: 2.365, 20 4: 1.846): Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 9 (features: 21 10: 3.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-21.10-beta-pack (Silhouette Score: 1.000) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 20 (features: 22 4: 2.000): Average Silhouette Score = 1.000 (the higher the better) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 21 (features: 22 4: 1.439, 4 3: 1.389): Average Silhouette Score = 1.000 (the higher the better) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 29 (features: 14 4: 4.684, 4 5: 0.729): Average Silhouette Score = 0.639 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.755) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.755) ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.755) ubuntu-14.04-server-i386.iso (Silhouette Score: 0.755) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.173) Cluster 15 (features: 12 4: 3.922, 4 5: 2.039, 4 4: 0.693): Average Silhouette Score = 0.405 (the higher the better) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.574) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.574) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.574) ubuntu-12.04-server-i386.iso (Silhouette Score: 0.266) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.040) Cluster 0 (features: 15 4: 3.000, 17 4: 1.000): Average Silhouette Score = 0.323 (the higher the better) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 0.529) ubuntu-15.04-desktop-i386.iso (Silhouette Score: 0.529) ubuntu-15.04-server-amd64.iso (Silhouette Score: 0.529) ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.293) Cluster 30 (features: 14 4: 2.014, 4 6: 1.483, 4 4: 0.741): Average Silhouette Score = 0.198 (the higher the better) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 0.388) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 0.388) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: -0.183) Cluster 8 (features: 21 4: 2.000, 20 10: 1.000): Average Silhouette Score = 0.098 (the higher the better) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.293) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.293) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.293) Cluster 4 (features: 22 10: 2.000, 23 4: 1.000): Average Silhouette Score = 0.098 (the higher the better) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.293) ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 0.293) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.293) Cluster 19 (features: 18 4: 1.365, 4 4: 0.731, 4 6: 0.731): Average Silhouette Score = -0.229 (the higher the better) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: -0.229) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.229) Cluster 28 (features: 18 4: 1.391, 4 3: 0.719, 4 5: 0.719): Average Silhouette Score = -0.232 (the higher the better) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.232) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.232) Cluster 5 (features: 0 1: 1.000, 10 10: 1.000): Average Silhouette Score = -0.293 (the higher the better) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.293) ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.293) Cluster -1 (features: 4 1: 2.220, 22 4: 1.386, 1 2014: 1.000, 1 4: 1.000, 1 9: 1.000): Average Silhouette Score = -0.332 (the higher the better) Ubuntu 10.04 Netbook (Silhouette Score: -0.293) Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.293) Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.293) Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.293) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.293) ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.293) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.293) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.293) ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.293) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.349) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.393) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: -0.403) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.429) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.434) ``` The script: ```python # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by N-gram TF-IDF vectorization, applied # to text data. # It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number, # calculating and interpreting key metrics like silhouette scores to evaluate cluster quality. # The script also explores the integration of word position into the clustering process through N-gram vectorization, # aiming to capture more nuanced relationships between terms. # The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining # each cluster, and understanding the contextual relationships within the data. # This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure # of the text data and the effectiveness of the modified clustering strategy. import re import string from collections import defaultdict from enum import Enum, auto from pathlib import Path import nltk import numpy as np from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.cluster import HDBSCAN from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import silhouette_samples # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') class Vectorizer(Enum): TFIDF = auto() TFIDF_NGRAMM = auto() vectorize_type = Vectorizer.TFIDF_NGRAMM # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] # Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers). # The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis. if vectorize_type == Vectorizer.TFIDF: vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b') X = vectorizer.fit_transform(preprocessed_results) elif vectorize_type == Vectorizer.TFIDF_NGRAMM: vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2)) X = vectorizer.fit_transform(preprocessed_results) # Get feature names (words) used by the TF-IDF vectorizer feature_names = vectorizer.get_feature_names_out() print(f'Features: \n{feature_names}') # Output original and preprocessed titles and their TF-IDF vectors print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n") for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)): # Accessing the i-th TF-IDF vector in sparse format directly tfidf_vector = X[i] # Extracting indices of non-zero elements (words that are actually present in the document) non_zero_indices = tfidf_vector.nonzero()[1] # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices] # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True) # Formatting the sorted TF-IDF values into a string for easy display sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples]) # Print sorted TF-IDF values print(f'\tOriginal: {original}') print(f'\tPreprocessed: {preprocessed}') print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n') print("Clustering...") # Initialize and fit the HDBSCAN model hdbscan = HDBSCAN(min_cluster_size=2) hdbscan.fit(X) # Retrieve cluster labels labels = hdbscan.labels_ # Initialize dictionaries for storing clustering results clusters = defaultdict(list) clusters_indices = defaultdict(list) # Calculate silhouette scores for each data point in X based on their cluster membership silhouette_vals = silhouette_samples(X, labels, metric='euclidean') # Store silhouette scores for each cluster for later analysis cluster_silhouette_scores = defaultdict(list) # Group data points by their cluster label for i, label in enumerate(labels): clusters[label].append(results[i]) clusters_indices[label].append(i) cluster_silhouette_scores[label].append(silhouette_vals[i]) # Initialize a dictionary to store the sum of TF-IDF values for features by cluster cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1])) # Sum up TF-IDF values for each feature within each cluster for i, label in enumerate(labels): cluster_feature_sums[label] += X[i].toarray()[0] # Number of top features to select for each cluster top_n_features = 5 feature_names = vectorizer.get_feature_names_out() # Dictionary to store the top N features for each cluster top_features_per_cluster = {} for cluster, sums in cluster_feature_sums.items(): # Indices of features with sums greater than 0, sorted by their sum in descending order positive_indices = [index for index, value in enumerate(sums) if value > 0] top_indices = sorted(positive_indices, key=lambda index: sums[index], reverse=True)[:top_n_features] # Extract the feature names and their sums for the top features with values greater than 0 top_features = [(feature_names[index], sums[index]) for index in top_indices if sums[index] > 0] top_features_per_cluster[cluster] = top_features # First, calculate the average silhouette score for each cluster average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()} # Then, sort the clusters by their average silhouette score sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True) # Output clustering results, now sorted by the average silhouette score print("\nClustering results by cluster:") for cluster in sorted_clusters: features = top_features_per_cluster[cluster] average_score = average_scores[cluster] features_str = (f"{feature}: {value:.3f}" for feature, value in features) features_line = ', '.join(features_str) print(f"\nCluster {cluster} (features: {features_line}):") print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)") # Prepare and sort titles within the cluster by their silhouette score titles_scores = [] for title_index in clusters_indices[cluster]: title = results[title_index] silhouette_score = silhouette_vals[title_index] titles_scores.append((title, silhouette_score)) sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True) # Print each title with its silhouette score for title, silhouette_score in sorted_titles_scores: print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") ```
drew2a commented 6 months ago

To achieve more specific clustering results, such as differentiating between clusters for "Ubuntu 20.04.X" instead of a more general "Ubuntu 20.04," the following HDBSCAN constructor parameters can be adjusted: min_samples and cluster_selection_epsilon.

Conversely, to configure HDBSCAN for creating more general groups, the same parameters can be adjusted in the opposite direction. Decreasing min_samples allows for more lenient cluster formation, potentially grouping various subversions of Ubuntu 20.04 into a single cluster. Similarly, increasing cluster_selection_epsilon encourages the merging of nearby clusters into larger, more general groups.

By fine-tuning these parameters, HDBSCAN can be tailored to identify clusters at the desired level of specificity, from highly detailed clusters differentiating between minor variations to broader groups encompassing more general categories.

Below are two examples:

``` Cluster 9 (features: ): Average Silhouette Score = 1.000 (the higher the better) Ubuntu (Silhouette Score: 1.000) Ubuntu Linux ebook pack (Silhouette Score: 1.000) Ubuntu Linux основы администрирования (Silhouette Score: 1.000) Ubuntu Netbook Remix (Silhouette Score: 1.000) Ubuntu reducido (Silhouette Score: 1.000) Ubuntu-Book_RU.djvu (Silhouette Score: 1.000) ubuntu (Silhouette Score: 1.000) Cluster 31 (features: 20 4: 2.019, 20: 1.977, 4: 1.007): Average Silhouette Score = 1.000 (the higher the better) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000) ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 0 (features: 11 10: 2.066, 11: 1.873, 10: 1.105): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000) ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000) Cluster 1 (features: 12 10: 1.462, 12: 1.151, 10: 0.733): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000) ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000) Cluster 2 (features: 16 10: 2.937, 16: 2.152, 10: 1.656): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 16.10 (Silhouette Score: 1.000) ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.10-server-arm64.iso (Silhouette Score: 1.000) Cluster 21 (features: 4 1: 1.155, 1: 1.005, 20 4: 0.866, 20: 0.849, 4: 0.432): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 23 (features: 2 0: 0.973, 0: 0.913, 2: 0.827, 4 2: 0.827, 20 4: 0.621): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 28 (features: 3: 1.616, 4 3: 1.616, 20 4: 1.308, 20: 1.281, 4: 0.653): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000) ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 8 (features: 9 10: 1.370, 9: 1.285, 10: 0.687): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 9.10 (Silhouette Score: 1.000) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000) Cluster 3 (features: 2015: 2.000): Average Silhouette Score = 1.000 (the higher the better) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000) Cluster 22 (features: 2: 1.111, 4 2: 1.111, 20 4: 0.833, 20: 0.816, 4: 0.416): Average Silhouette Score = 1.000 (the higher the better) Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 25 (features: 4 5: 1.043, 5: 1.043, 16 4: 0.951, 16: 0.862, 4: 0.421): Average Silhouette Score = 1.000 (the higher the better) Ubuntu-16.04.5 (Silhouette Score: 1.000) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 7 (features: 11 4: 1.481, 11: 1.259, 4: 0.471): Average Silhouette Score = 1.000 (the higher the better) ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 11 (features: 12 4: 1.556, 12: 1.442, 4 5: 1.442, 5: 1.442, 4: 0.583): Average Silhouette Score = 1.000 (the higher the better) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 1.000) Cluster 15 (features: 4 6: 1.044, 6: 1.044, 14 4: 0.945, 14: 0.873, 4: 0.407): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000) Cluster 4 (features: 14 10: 2.226, 14: 1.621, 10: 1.191): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 13 (features: 15: 2.063, 15 4: 2.063, 4: 0.700): Average Silhouette Score = 1.000 (the higher the better) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000) Cluster 20 (features: 4 6: 1.589, 6: 1.589, 16 4: 1.399, 16: 1.268, 4: 0.619): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000) Cluster 10 (features: 4 7: 1.147, 7: 1.147, 16 4: 0.824, 16: 0.747, 4: 0.365): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000) Cluster 16 (features: 18 10: 1.504, 18: 1.081, 10: 0.754): Average Silhouette Score = 1.000 (the higher the better) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 19 (features: 19 10: 2.066, 19: 1.873, 10: 1.105): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 32 (features: 4 4: 1.792, 20 4: 1.399, 4: 1.397, 20: 1.371): Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 29 (features: 21 10: 2.066, 21: 1.873, 10: 1.105): Average Silhouette Score = 1.000 (the higher the better) ubuntu-21.10-beta-pack (Silhouette Score: 1.000) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 26 (features: 22 4: 1.406, 22: 1.312, 4: 0.548): Average Silhouette Score = 1.000 (the higher the better) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 27 (features: 22 4: 1.015, 3: 0.979, 4 3: 0.979, 22: 0.947, 4: 0.395): Average Silhouette Score = 1.000 (the higher the better) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 30 (features: 22 10: 1.477, 22: 1.126, 10: 0.741): Average Silhouette Score = 1.000 (the higher the better) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 17 (features: 18 4: 2.606, 18: 2.457, 4: 1.304, 4 4: 0.554): Average Silhouette Score = 0.632 (the higher the better) Ubuntu-18.04 (Silhouette Score: 0.732) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.732) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.732) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.332) Cluster 24 (features: 16 4: 1.884, 16: 1.708, 4: 0.834, 3: 0.521, 4 3: 0.521): Average Silhouette Score = 0.364 (the higher the better) ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.500) ubuntu-pack-16.04-unity (Silhouette Score: 0.500) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: 0.091) Cluster 14 (features: 14 4: 4.275, 14: 3.947, 4: 2.060, 4 4: 0.566, 4 1: 0.551): Average Silhouette Score = 0.355 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.541) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.541) ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.541) ubuntu-14.04-server-i386.iso (Silhouette Score: 0.541) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.224) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.067) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 0.034) Cluster 12 (features: 12 4: 1.254, 12: 1.162, 4: 0.674, 4 4: 0.526): Average Silhouette Score = 0.267 (the higher the better) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.338) ubuntu-12.04-server-i386.iso (Silhouette Score: 0.196) Cluster 5 (features: 19 4: 1.481, 19: 1.259, 17 4: 0.720, 4: 0.682, 17: 0.662): Average Silhouette Score = 0.116 (the higher the better) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 0.311) ubuntu-19.04-server-amd64.iso (Silhouette Score: 0.311) ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.275) Cluster 6 (features: 21 4: 1.481, 21: 1.259, 23 4: 0.720, 4: 0.682, 23: 0.662): Average Silhouette Score = 0.116 (the higher the better) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.311) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.311) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.275) Cluster 18 (features: 18 4: 0.976, 18: 0.920, 3: 0.504, 4 3: 0.504, 4 5: 0.504): Average Silhouette Score = -0.194 (the higher the better) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.194) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.194) Cluster -1 (features: 1: 2.837, 10: 2.097, 4: 1.897, 2014: 1.267, 4 1: 1.066): Average Silhouette Score = -0.305 (the higher the better) ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.254) Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.258) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.268) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.268) Ubuntu 10.04 Netbook (Silhouette Score: -0.268) ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.271) Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.276) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.279) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.280) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.280) Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.286) ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.286) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293) Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.377) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.399) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: -0.428) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.433) ```
``` Cluster 5 (features: 9 10: 1.370, 9: 1.285, 10: 0.687): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 9.10 (Silhouette Score: 1.000) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000) Cluster 14 (features: 4 6: 1.044, 6: 1.044, 14 4: 0.945, 14: 0.873, 4: 0.407): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000) Cluster 0 (features: 15: 2.063, 15 4: 2.063, 4: 0.700): Average Silhouette Score = 1.000 (the higher the better) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000) Cluster 19 (features: 4 6: 1.589, 6: 1.589, 16 4: 1.399, 16: 1.268, 4: 0.619): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000) Cluster 9 (features: 4 7: 1.147, 7: 1.147, 16 4: 0.824, 16: 0.747, 4: 0.365): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000) Cluster 10 (features: 19 10: 2.066, 19: 1.873, 10: 1.105): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 22 (features: 4 4: 1.792, 20 4: 1.399, 4: 1.397, 20: 1.371): Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 15 (features: 18 4: 2.606, 18: 2.457, 4: 1.304, 4 4: 0.554): Average Silhouette Score = 0.632 (the higher the better) Ubuntu-18.04 (Silhouette Score: 0.732) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.732) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.732) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.332) Cluster 17 (features: 2: 1.938, 4 2: 1.938, 20 4: 1.454, 20: 1.424, 2 0: 0.973): Average Silhouette Score = 0.489 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.524) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.524) Ubuntu Server 20.04.2 LTS (Silhouette Score: 0.454) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.454) Cluster 8 (features: 12 4: 2.810, 12: 2.604, 4 5: 1.442, 5: 1.442, 4: 1.257): Average Silhouette Score = 0.455 (the higher the better) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.596) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.596) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.596) ubuntu-12.04-server-i386.iso (Silhouette Score: 0.306) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.184) Cluster 13 (features: 14 4: 4.275, 14: 3.947, 4: 2.060, 4 4: 0.566, 4 1: 0.551): Average Silhouette Score = 0.381 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.541) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.541) ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.541) ubuntu-14.04-server-i386.iso (Silhouette Score: 0.541) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.224) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.143) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 0.139) Cluster 11 (features: 21 10: 2.066, 21: 1.873, 10: 1.477, 20 10: 0.805, 20: 0.462): Average Silhouette Score = 0.362 (the higher the better) ubuntu-21.10-beta-pack (Silhouette Score: 0.562) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.562) ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 0.562) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.239) Cluster 12 (features: 22 4: 3.414, 22: 3.187, 4: 1.330, 3: 0.979, 4 3: 0.979): Average Silhouette Score = 0.243 (the higher the better) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.395) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.395) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.251) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.251) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.107) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 0.059) Cluster 18 (features: 16 4: 2.835, 16: 2.570, 4: 1.255, 4 5: 1.043, 5: 1.043): Average Silhouette Score = 0.232 (the higher the better) Ubuntu-16.04.5 (Silhouette Score: 0.337) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 0.337) ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.265) ubuntu-pack-16.04-unity (Silhouette Score: 0.265) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.041) Cluster 4 (features: 21 4: 1.481, 21: 1.259, 19 4: 0.741, 4: 0.707, 19: 0.629): Average Silhouette Score = 0.118 (the higher the better) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.313) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.313) ubuntu-19.04-server-amd64.iso (Silhouette Score: -0.272) Cluster 3 (features: 14 10: 2.226, 14: 1.621, 11 4: 1.481, 11: 1.259, 10: 1.191): Average Silhouette Score = -0.106 (the higher the better) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 0.057) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 0.057) ubuntu-14.10-server-amd64.iso (Silhouette Score: 0.057) ubuntu-11.04-alternate-i386.iso (Silhouette Score: -0.142) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: -0.142) ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.284) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: -0.343) Cluster 16 (features: 18 4: 0.976, 18: 0.920, 3: 0.504, 4 3: 0.504, 4 5: 0.504): Average Silhouette Score = -0.194 (the higher the better) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.194) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.194) Cluster -1 (features: 10: 2.910, 1: 2.022, 18: 1.990, 4: 1.748, 4 1: 1.696): Average Silhouette Score = -0.271 (the higher the better) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: -0.196) ubuntu-18.10-server-amd64.iso (Silhouette Score: -0.196) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: -0.213) ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: -0.213) Ubuntu 10.04 Netbook (Silhouette Score: -0.255) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.257) ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.260) Ubuntu 12.10 Desktop (i386) (Silhouette Score: -0.268) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.271) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.272) Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.285) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: -0.324) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: -0.324) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.352) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.385) Cluster 7 (features: 2015: 2.000, 1: 1.363, 10: 1.045, 2019: 1.000, 6685: 1.000): Average Silhouette Score = -0.272 (the higher the better) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: -0.214) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: -0.214) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.277) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.277) Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.277) ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.283) ubuntu-12.10-desktop-i386.iso (Silhouette Score: -0.283) Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.291) ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.291) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293) Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293) ```

The selected parameter values for the HDBSCAN constructor are not definitive but are intended to illustrate the potential for enhancing clustering quality through careful optimization. The optimal settings for these parameters can significantly vary, underscoring the importance of adjustment based on the specific clustering goals. Intuitively, the choice of these values should align with the user's objectives: broader topic identification might necessitate one set of parameters, while uncovering more detailed, dense information may require a different configuration.

Ref:

drew2a commented 6 months ago

The next step involved a deeper exploration of vectorization algorithms to determine if there are more advanced options beyond TFIDF that could better suit our needs. This exploration led us to experiment with FastText, an advanced word embedding technique known for capturing the nuances of word semantics and relationships more effectively than traditional TFIDF. FastText, by leveraging neural network models, generates vector representations of words that incorporate the context in which words appear, as well as the morphology of the words themselves.

While the results obtained with FastText were practically identical to those achieved with TFIDF, a key distinction emerged: FastText's flexibility in analyzing all presented tokens, not just the numeric ones as was the case in the previous version using TFIDF.

``` Cluster 1: Average Silhouette Score = 1.000 (the higher the better) Ubuntu (Silhouette Score: 1.000) ubuntu (Silhouette Score: 1.000) Cluster 6: Average Silhouette Score = 0.668 (the higher the better) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 0.669) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 0.668) Cluster 3: Average Silhouette Score = 0.621 (the higher the better) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 0.629) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 0.613) Cluster 15: Average Silhouette Score = 0.571 (the higher the better) ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 0.576) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.565) Cluster 7: Average Silhouette Score = 0.565 (the higher the better) ubuntu-18.10-server-amd64.iso (Silhouette Score: 0.574) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 0.556) Cluster 18: Average Silhouette Score = 0.524 (the higher the better) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 0.530) ubuntu-19.04-server-amd64.iso (Silhouette Score: 0.517) Cluster 23: Average Silhouette Score = 0.520 (the higher the better) ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 0.521) ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 0.518) Cluster 21: Average Silhouette Score = 0.495 (the higher the better) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 0.552) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 0.483) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 0.450) Cluster 9: Average Silhouette Score = 0.485 (the higher the better) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 0.522) ubuntu-15.04-desktop-i386.iso (Silhouette Score: 0.468) ubuntu-15.04-server-amd64.iso (Silhouette Score: 0.463) Cluster 8: Average Silhouette Score = 0.467 (the higher the better) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 0.488) ubuntu-14.10-server-amd64.iso (Silhouette Score: 0.477) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 0.435) Cluster 14: Average Silhouette Score = 0.395 (the higher the better) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 0.429) ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 0.387) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 0.368) Cluster 22: Average Silhouette Score = 0.388 (the higher the better) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.418) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.358) Cluster 10: Average Silhouette Score = 0.376 (the higher the better) ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.406) ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.378) ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 0.344) Cluster 0: Average Silhouette Score = 0.374 (the higher the better) Ubuntu 9.10 (Silhouette Score: 0.497) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 0.251) Cluster 11: Average Silhouette Score = 0.344 (the higher the better) ubuntu-11.04-alternate-i386.iso (Silhouette Score: 0.379) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 0.310) Cluster 25: Average Silhouette Score = 0.338 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 0.436) ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 0.325) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 0.253) Cluster 20: Average Silhouette Score = 0.321 (the higher the better) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 0.365) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 0.278) Cluster 28: Average Silhouette Score = 0.313 (the higher the better) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.327) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.300) Cluster 27: Average Silhouette Score = 0.261 (the higher the better) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.284) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.238) Cluster 4: Average Silhouette Score = 0.243 (the higher the better) ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 0.248) ubuntu-11.10-desktop-i386.iso (Silhouette Score: 0.238) Cluster 24: Average Silhouette Score = 0.242 (the higher the better) ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 0.277) ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 0.207) Cluster 17: Average Silhouette Score = 0.239 (the higher the better) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.414) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.378) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.361) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 0.116) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: -0.075) Cluster 19: Average Silhouette Score = 0.228 (the higher the better) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.334) ubuntu-14.04-server-i386.iso (Silhouette Score: 0.302) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.264) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.014) Cluster 16: Average Silhouette Score = 0.211 (the higher the better) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 0.296) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 0.271) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: 0.146) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.129) Cluster 5: Average Silhouette Score = 0.186 (the higher the better) Ubuntu 12.10 Desktop (i386) (Silhouette Score: 0.297) ubuntu-12.10-desktop-i386.iso (Silhouette Score: 0.074) Cluster 12: Average Silhouette Score = 0.157 (the higher the better) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.273) ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 0.263) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.076) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: 0.016) Cluster 26: Average Silhouette Score = 0.126 (the higher the better) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.314) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.209) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.082) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.100) Cluster 13: Average Silhouette Score = 0.116 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.311) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.242) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.043) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.132) Cluster 2: Average Silhouette Score = 0.113 (the higher the better) Ubuntu Linux основы администрирования (Silhouette Score: 0.195) Ubuntu Linux ebook pack (Silhouette Score: 0.175) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 0.110) Ubuntu reducido (Silhouette Score: 0.063) Ubuntu Unleashed 2019 Edition (Silhouette Score: 0.020) Cluster -1: Average Silhouette Score = -0.382 (the higher the better) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.175) Ubuntu-Book_RU.djvu (Silhouette Score: -0.209) Ubuntu 10.04 Netbook (Silhouette Score: -0.231) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.233) Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.249) Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.264) Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.301) ubuntu-pack-16.04-unity (Silhouette Score: -0.311) Ubuntu Netbook Remix (Silhouette Score: -0.317) Ubuntu-16.04.5 (Silhouette Score: -0.330) ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.337) ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.376) Ubuntu Server 20.04.2 LTS (Silhouette Score: -0.382) ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.386) Ubuntu-18.04 (Silhouette Score: -0.388) ubuntu-12.04-server-i386.iso (Silhouette Score: -0.406) ubuntu-14.04-server-amd64.ova (Silhouette Score: -0.411) ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.418) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: -0.422) Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: -0.425) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.433) ubuntu-16.04-desktop-i386.iso (Silhouette Score: -0.438) ubuntu-21.10-beta-pack (Silhouette Score: -0.448) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.454) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.458) Ubuntu 16.10 (Silhouette Score: -0.487) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.495) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.496) Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: -0.513) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: -0.514) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.521) ``` The script: ```python # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied # to text data. # It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number, # calculating and interpreting key metrics like silhouette scores to evaluate cluster quality. # The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining # each cluster, and understanding the contextual relationships within the data. # This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure # of the text data and the effectiveness of the modified clustering strategy. import re import string from collections import defaultdict from pathlib import Path import nltk import numpy as np from gensim.models import FastText from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.cluster import HDBSCAN from sklearn.metrics import silhouette_samples # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) first_title = results[0] # Preprocess titles preprocessed_results = [preprocess_text(title) for title in results] def extract_digits(tokens): return [token for token in tokens if re.match(r'^\d+$', token)] digit_only_titles = [extract_digits(title.split()) for title in preprocessed_results] model = FastText(sentences=digit_only_titles, vector_size=100, window=5, min_count=1, workers=4) X = np.array([np.mean([model.wv[word] for word in title.split() if word in model.wv], axis=0) for title in preprocessed_results]) print("Clustering...") # Initialize and fit the HDBSCAN model hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0) hdbscan.fit(X) # Retrieve cluster labels labels = hdbscan.labels_ # Initialize dictionaries for storing clustering results clusters = defaultdict(list) clusters_indices = defaultdict(list) # Calculate silhouette scores for each data point in X based on their cluster membership silhouette_vals = silhouette_samples(X, labels, metric='euclidean') # Store silhouette scores for each cluster for later analysis cluster_silhouette_scores = defaultdict(list) # Group data points by their cluster label for i, label in enumerate(labels): clusters[label].append(results[i]) clusters_indices[label].append(i) cluster_silhouette_scores[label].append(silhouette_vals[i]) # Initialize a dictionary to store the sum of TF-IDF values for features by cluster cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1])) # First, calculate the average silhouette score for each cluster average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()} # Then, sort the clusters by their average silhouette score sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True) # Output clustering results, now sorted by the average silhouette score print("\nClustering results by cluster:") for cluster in sorted_clusters: average_score = average_scores[cluster] print(f"\nCluster {cluster}:") print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)") # Prepare and sort titles within the cluster by their silhouette score titles_scores = [] for title_index in clusters_indices[cluster]: title = results[title_index] silhouette_score = silhouette_vals[title_index] titles_scores.append((title, silhouette_score)) sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True) # Print each title with its silhouette score for title, silhouette_score in sorted_titles_scores: print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") ```

Ref:

drew2a commented 6 months ago

This endeavor was an attempt to leverage transformers, specifically the BERT model, as a tokenizer in our clustering process. When utilizing BERT as a tokenizer, we observe that it delivers inferior results compared to analyzing either the entire titles or only the extracted digits. Additionally, BERT is noticeably slower and more resource-intensive.

I acknowledge that I haven't delved deeply into BERT's intricacies, as it's a complex model, and gaining a thorough understanding would require a substantial investment of time. My goal was to create a simple, almost out-of-the-box example to get a sense of how it operates.

``` Cluster 30: Average Silhouette Score = 1.000 (the higher the better) Ubuntu (Silhouette Score: 1.000) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 12: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 10.04 Netbook (Silhouette Score: 1.000) Ubuntu-18.04 (Silhouette Score: 1.000) Ubuntu-Book_RU.djvu (Silhouette Score: 1.000) Cluster 11: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000) ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000) Cluster 10: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000) Cluster 19: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 16.10 (Silhouette Score: 1.000) ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 3: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000) ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000) Cluster 22: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 5: Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000) Ubuntu 9.10 (Silhouette Score: 1.000) Cluster 20: Average Silhouette Score = 1.000 (the higher the better) Ubuntu Linux ebook pack (Silhouette Score: 1.000) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 26: Average Silhouette Score = 1.000 (the higher the better) Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000) Cluster 35: Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 1.000) Cluster 7: Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04-server-amd64.ova (Silhouette Score: 1.000) ubuntu-14.04-server-i386.iso (Silhouette Score: 1.000) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 1.000) Cluster 33: Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 1.000) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000) Cluster 6: Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 34: Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000) Cluster 9: Average Silhouette Score = 1.000 (the higher the better) ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04-desktop-i386.iso (Silhouette Score: 1.000) Cluster 13: Average Silhouette Score = 1.000 (the higher the better) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 23: Average Silhouette Score = 1.000 (the higher the better) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 21: Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 31: Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 28: Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 17: Average Silhouette Score = 1.000 (the higher the better) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 1.000) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 32: Average Silhouette Score = 0.645 (the higher the better) ubuntu-12.04-server-i386.iso (Silhouette Score: 0.773) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.773) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.773) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.773) ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 0.131) Cluster 8: Average Silhouette Score = 0.526 (the higher the better) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 0.681) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 0.681) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 0.681) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 0.058) Cluster 25: Average Silhouette Score = 0.519 (the higher the better) ubuntu-10.10-xenon-beta5 (Silhouette Score: 0.689) ubuntu-11.04-alternate-i386.iso (Silhouette Score: 0.689) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 0.689) ubuntu-13.04-desktop-i386.iso (Silhouette Score: 0.008) Cluster 14: Average Silhouette Score = 0.482 (the higher the better) Ubuntu reducido (Silhouette Score: 0.597) Ubuntu-16.04.5 (Silhouette Score: 0.597) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 0.251) Cluster 2: Average Silhouette Score = 0.435 (the higher the better) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 0.517) Ubuntu Facile 04 2014.pdf (Silhouette Score: 0.517) Ubuntu Satanic Edition 666.4 (Silhouette Score: 0.270) Cluster 24: Average Silhouette Score = 0.428 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 0.545) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 0.545) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.194) Cluster 29: Average Silhouette Score = 0.400 (the higher the better) ubuntu-21.10-beta-pack (Silhouette Score: 0.530) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.530) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.139) Cluster 18: Average Silhouette Score = 0.363 (the higher the better) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.516) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.516) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 0.057) Cluster 1: Average Silhouette Score = 0.304 (the higher the better) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 0.307) Ubuntu Facile 01 2014.pdf (Silhouette Score: 0.300) Cluster 4: Average Silhouette Score = 0.295 (the higher the better) Ubuntu Netbook Remix (Silhouette Score: 0.338) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.253) Cluster 0: Average Silhouette Score = 0.264 (the higher the better) Ubuntu Linux основы администрирования (Silhouette Score: 0.312) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 0.216) Cluster 15: Average Silhouette Score = 0.171 (the higher the better) ubuntu-12.10-desktop-i386.iso (Silhouette Score: 0.191) ubuntu (Silhouette Score: 0.152) Cluster 27: Average Silhouette Score = 0.135 (the higher the better) ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.209) ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.156) ubuntu-17.04-server-amd64.iso (Silhouette Score: 0.041) Cluster 16: Average Silhouette Score = 0.048 (the higher the better) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: 0.130) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.053) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: -0.040) Cluster -1: Average Silhouette Score = -0.466 (the higher the better) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.285) ubuntu-16.10-desktop-amd64.iso (Silhouette Score: -0.450) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.480) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: -0.486) Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.496) ubuntu-20.04-live-server-amd64.iso (Silhouette Score: -0.520) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.547) ``` The script: ```python # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied # to text data. # It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number, # calculating and interpreting key metrics like silhouette scores to evaluate cluster quality. # The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining # each cluster, and understanding the contextual relationships within the data. # This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure # of the text data and the effectiveness of the modified clustering strategy. import re import string from collections import defaultdict from pathlib import Path import nltk import numpy as np import torch from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.cluster import HDBSCAN from sklearn.metrics import silhouette_samples from transformers import BertModel, BertTokenizer # Initialize NLTK resources nltk.download('stopwords') nltk.download('wordnet') # Load titles from a text file results = list(sorted( r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r )) def preprocess_text(text): # Remove text inside parentheses and brackets text = re.sub(r'\[.*?\]|\(.*?\)', '', text) # Convert to lowercase text = text.lower() # Replace punctuation and hyphens with spaces text = re.sub(r'[' + string.punctuation + ']', ' ', text) # Remove leading zeros text = re.sub(r'\b0+(\d+)\b', r'\1', text) # Remove stopwords stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word and word not in stop_words] # Lemmatize lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] return ' '.join(words) first_title = results[0] # Preprocess titles print("Preprocessing titles...") preprocessed_results = [preprocess_text(title) for title in results] print("Loading BERT model and tokenizer...") model_name = 'bert-base-uncased' tokenizer = BertTokenizer.from_pretrained(model_name) model = BertModel.from_pretrained(model_name) def get_bert_embedding(text): inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy() return embeddings def extract_digits(text): return ' '.join(re.findall(r'\b\d+\b', text)) print("Loading and preprocessing titles...") digit_only_titles = [extract_digits(title) for title in preprocessed_results] print("Transforming titles to BERT embeddings...") X = np.array([get_bert_embedding(title) for title in digit_only_titles if title.strip() != '']) print("Clustering...") # Initialize and fit the HDBSCAN model hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0) hdbscan.fit(X) # Retrieve cluster labels labels = hdbscan.labels_ # Initialize dictionaries for storing clustering results clusters = defaultdict(list) clusters_indices = defaultdict(list) # Calculate silhouette scores for each data point in X based on their cluster membership silhouette_vals = silhouette_samples(X, labels, metric='euclidean') # Store silhouette scores for each cluster for later analysis cluster_silhouette_scores = defaultdict(list) # Group data points by their cluster label for i, label in enumerate(labels): clusters[label].append(results[i]) clusters_indices[label].append(i) cluster_silhouette_scores[label].append(silhouette_vals[i]) # Initialize a dictionary to store the sum of TF-IDF values for features by cluster cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1])) # First, calculate the average silhouette score for each cluster average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()} # Then, sort the clusters by their average silhouette score sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True) # Output clustering results, now sorted by the average silhouette score print("\nClustering results by cluster:") for cluster in sorted_clusters: average_score = average_scores[cluster] print(f"\nCluster {cluster}:") print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)") # Prepare and sort titles within the cluster by their silhouette score titles_scores = [] for title_index in clusters_indices[cluster]: title = results[title_index] silhouette_score = silhouette_vals[title_index] titles_scores.append((title, silhouette_score)) sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True) # Print each title with its silhouette score for title, silhouette_score in sorted_titles_scores: print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") ```

Ref:

drew2a commented 6 months ago

The final improvement in this iteration was an attempt to modify the standard TF-IDF algorithm to account for the position of tokens, which led to better results (comparable to N-Grams with TFIDF) than all other experiments conducted. By incorporating token positioning, we were able to differentiate between identical tokens based on their locations within the text, offering a deeper insight into the document's structure. Though our implementation is somewhat naive and likely not the most efficient in terms of performance, it represents a swift and straightforward prototype.

The trick is:

class PositionalDigitTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        # Custom analyzer that extracts digits and their position within the document
        def positional_digit_analyzer(doc):
            # Splitting the document into words
            words = doc.split()
            # Initializing a list to store digits with their position
            positional_digits = []
            # Iterating over words and their indexes in the list
            for index, word in enumerate(words, start=1):  # Indexing starts from 1
                if word.isdigit():  # Checking if the word is a digit
                    # Adding to the list in the format "digit_position_in_text"
                    positional_digits.append(f"{word}_{index}")
                    positional_digits.append(f"{word}")
            return positional_digits

        return positional_digit_analyzer

This modification leads to tokens that look like:

    Original:     ubuntu-mate-20.04.4-desktop-amd64.iso
    Preprocessed: ubuntu mate 20 4 4 desktop amd64 iso
    TF-IDF:
            4_5: 0.562
            20_3: 0.527
            4_4: 0.393
            4: 0.358
            20: 0.351

    Original:     ubuntu-mate-21.10-desktop-amd64.iso
    Preprocessed: ubuntu mate 21 10 desktop amd64 iso
    TF-IDF:
            21_3: 0.624
            10_4: 0.538
            21: 0.488
            10: 0.288

For further refinement of the algorithm, the following paper could be used as a reference: "Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word" available at https://www.atlantis-press.com/proceedings/aiie-16/25866330. This research suggests possible avenues for enhancing the complexity and effectiveness of our approach, indicating more advanced strategies for incorporating positional information into text vectorization.

``` Cluster 4 (features: ): Average Silhouette Score = 1.000 (the higher the better) Ubuntu (Silhouette Score: 1.000) Ubuntu Linux ebook pack (Silhouette Score: 1.000) Ubuntu Linux основы администрирования (Silhouette Score: 1.000) Ubuntu Netbook Remix (Silhouette Score: 1.000) Ubuntu reducido (Silhouette Score: 1.000) Ubuntu-Book_RU.djvu (Silhouette Score: 1.000) ubuntu (Silhouette Score: 1.000) Cluster 31 (features: 20_2: 1.950, 20: 1.827, 4_3: 0.997, 4: 0.931): Average Silhouette Score = 1.000 (the higher the better) Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000) ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 14 (features: 11: 1.812, 11_2: 1.812, 10_3: 1.135, 10: 1.069): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000) ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000) Cluster 18 (features: 12: 1.182, 12_2: 1.182, 10_3: 0.799, 10: 0.753): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000) ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000) Cluster 26 (features: 1_4: 1.058, 1: 0.993, 20_2: 0.895, 20: 0.839, 4_3: 0.458): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000) ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 22 (features: 3_4: 1.098, 3: 1.018, 20_2: 0.861, 20: 0.807, 4_3: 0.441): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000) ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 8 (features: 9_2: 1.287, 9: 1.207, 10_3: 0.685, 10: 0.646): Average Silhouette Score = 1.000 (the higher the better) Ubuntu 9.10 (Silhouette Score: 1.000) Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000) Cluster 5 (features: 2015: 1.414, 2015_4: 1.414): Average Silhouette Score = 1.000 (the higher the better) Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000) Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000) Cluster 30 (features: 5: 1.033, 5_4: 1.033, 16_2: 0.874, 16: 0.854, 4_3: 0.447): Average Silhouette Score = 1.000 (the higher the better) Ubuntu-16.04.5 (Silhouette Score: 1.000) ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 13 (features: 11: 1.318, 11_2: 1.318, 4_3: 0.529, 4: 0.494): Average Silhouette Score = 1.000 (the higher the better) ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000) ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 12 (features: 12: 1.438, 12_2: 1.438, 5: 1.438, 5_4: 1.438, 4_3: 0.623): Average Silhouette Score = 1.000 (the higher the better) ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 1.000) Cluster 21 (features: 6: 1.036, 6_4: 1.036, 14: 0.866, 14_2: 0.866, 4_3: 0.433): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000) ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000) Cluster 28 (features: 14: 1.691, 14_2: 1.691, 10_3: 1.319, 10: 1.242): Average Silhouette Score = 1.000 (the higher the better) ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 25 (features: 6: 1.575, 6_4: 1.575, 16_2: 1.285, 16: 1.257, 4_3: 0.658): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000) Cluster 11 (features: 7: 1.138, 7_4: 1.138, 16_2: 0.759, 16: 0.742, 4_3: 0.388): Average Silhouette Score = 1.000 (the higher the better) ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000) Cluster 24 (features: 18: 1.148, 18_2: 1.148, 10_3: 0.850, 10: 0.801): Average Silhouette Score = 1.000 (the higher the better) ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000) Cluster 2 (features: 19_2: 1.352, 19: 1.292, 4_3: 0.518, 4: 0.484): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000) Cluster 3 (features: 19_2: 1.243, 19: 1.188, 10_3: 0.744, 10: 0.701): Average Silhouette Score = 1.000 (the higher the better) ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 32 (features: 4_4: 1.030, 20_2: 0.981, 4: 0.937, 20: 0.919, 4_3: 0.502): Average Silhouette Score = 1.000 (the higher the better) ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 6 (features: 21_2: 1.352, 21: 1.292, 4_3: 0.518, 4: 0.484): Average Silhouette Score = 1.000 (the higher the better) ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 1.000) ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 1.000) Cluster 7 (features: 21_2: 1.243, 21: 1.188, 10_3: 0.744, 10: 0.701): Average Silhouette Score = 1.000 (the higher the better) ubuntu-21.10-beta-pack (Silhouette Score: 1.000) ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000) Cluster 27 (features: 14: 3.036, 14_2: 3.036, 4: 1.644, 4_3: 1.517, 4_4: 0.502): Average Silhouette Score = 0.725 (the higher the better) ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.810) ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.810) ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.810) ubuntu-14.04-server-i386.iso (Silhouette Score: 0.810) ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.385) Cluster 23 (features: 18: 2.431, 18_2: 2.431, 4: 1.299, 4_3: 1.153, 4_4: 0.490): Average Silhouette Score = 0.657 (the higher the better) Ubuntu-18.04 (Silhouette Score: 0.744) ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.744) ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.744) ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.396) Cluster 29 (features: 16_2: 2.890, 16: 2.825, 10_3: 1.798, 10: 1.693, 4_3: 0.327): Average Silhouette Score = 0.652 (the higher the better) Ubuntu 16.10 (Silhouette Score: 0.807) ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 0.807) ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.807) ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.807) ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.033) Cluster 20 (features: 2_4: 1.399, 2: 1.337, 20_2: 1.048, 20: 0.982, 0_5: 0.949): Average Silhouette Score = 0.496 (the higher the better) Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.653) ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.653) ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.183) Cluster 15 (features: 17: 1.300, 17_2: 1.300, 10_3: 0.334, 10: 0.315, 4_3: 0.229): Average Silhouette Score = 0.441 (the higher the better) ubuntu-17.04-server-amd64.iso (Silhouette Score: 0.441) ubuntu-17.10-desktop-amd64.iso (Silhouette Score: 0.441) Cluster 16 (features: 23: 1.300, 23_2: 1.300, 10_3: 0.334, 10: 0.315, 4_3: 0.229): Average Silhouette Score = 0.441 (the higher the better) ubuntu-23.04-live-server-amd64.iso (Silhouette Score: 0.441) ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.441) Cluster 0 (features: 10_2: 1.486, 10: 1.077, 10_3: 0.352, 4_3: 0.281, 4: 0.263): Average Silhouette Score = 0.399 (the higher the better) Ubuntu 10.04 Netbook (Silhouette Score: 0.399) ubuntu-10.10-xenon-beta5 (Silhouette Score: 0.399) Cluster 17 (features: 12: 1.177, 12_2: 1.177, 4: 0.688, 4_3: 0.510, 4_4: 0.466): Average Silhouette Score = 0.302 (the higher the better) ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.384) ubuntu-12.04-server-i386.iso (Silhouette Score: 0.220) Cluster 19 (features: 22_2: 3.424, 22: 3.196, 4_3: 1.175, 4: 1.096, 2_4: 0.515): Average Silhouette Score = 0.195 (the higher the better) ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.419) ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.419) ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.166) ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 0.079) ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.073) ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.017) Cluster 10 (features: 1_3: 1.215, 2014: 1.203, 2014_4: 1.203, 1: 0.898, 4_4: 0.479): Average Silhouette Score = -0.016 (the higher the better) Ubuntu Facile 01 2014.pdf (Silhouette Score: 0.153) Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.065) ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.137) Cluster 9 (features: 20_3: 1.513, 4_4: 1.489, 3_5: 1.049, 20: 1.008, 4: 0.856): Average Silhouette Score = -0.046 (the higher the better) ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 0.075) ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: -0.063) Ubuntu Server 20.04.2 LTS (Silhouette Score: -0.067) ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: -0.130) Cluster 1 (features: 15: 2.001, 15_2: 2.001, 4_3: 0.940, 4: 0.877, 1_4: 0.877): Average Silhouette Score = -0.176 (the higher the better) ubuntu-15.04-desktop-amd64.iso (Silhouette Score: -0.000) ubuntu-15.04-desktop-i386.iso (Silhouette Score: -0.000) ubuntu-15.04-server-amd64.iso (Silhouette Score: -0.000) Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.272) [Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.272) ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.278) Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293) Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293) Cluster -1 (features: 18: 1.811, 18_2: 1.811, 4: 1.806, 10_4: 1.646, 4_3: 1.523): Average Silhouette Score = -0.338 (the higher the better) ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: -0.261) ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: -0.263) ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: -0.263) ubuntu-pack-16.04-unity (Silhouette Score: -0.277) Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.284) ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.350) ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.364) ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.366) ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.372) ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.375) ubuntu-14.04.5-server-amd64.iso (Silhouette Score: -0.383) ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.393) ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.439) ```

Disclaimer: The code for the scripts above was generated by ChatGPT.