Description:

Create a Python script or Jupyter notebook to analyze and categorize user queries to IQ GPT, providing insights into the most requested topics and types of questions.

Requirements:

Implement clustering algorithms to group similar queries
Analyze query patterns to identify most common topics
Categorize queries based on their type (e.g., price inquiries, technical questions, general information)
Generate visualizations to represent the clustered data
Provide summary statistics on query distribution

Technical Details:

Use Python for implementation
Utilize appropriate clustering tools and libraries (e.g., scikit-learn, NLTK)
Ensure the script can handle large datasets efficiently
Include data preprocessing steps for cleaning and normalizing queries

Expected Outputs:

A Python script or Jupyter notebook with well-commented code
Visualizations of clustered data (e.g., dendrograms, scatter plots)
Summary report of finding

DATASET EDA 📊

1. Dataset Overview

Total number of records: 39,316
Columns: 'Query' and 'Language'

2. Data Types:

Both 'Query' and 'Language' columns are of type 'object' (likely strings)

3. Missing Values:

'Query' column: 301 missing values
'Language' column: No missing values

4. Language Distribution:

English (en): 36,521 (92.89%)
Chinese (zh): 1,926 (4.90%)
Korean (kr): 854 (2.17%)
Japanese (ja): 15 (0.04%)

5. Query Characteristics:

Number of unique queries: 25,981
Most frequent query appears 635 times

6. Query Length Distribution:

The histogram showed that most queries are relatively short
There's a long tail of longer queries, but they are less common

7. Sample Data: We saw a few sample queries, which were primarily questions about cryptocurrency topics, such as:

"What is the best RPC provider for ethereum ?"
"What is manifold finance ?"
"What is bitcoin ?"
"What is Coinbase ?"
"Where is ethereum conference held ?"

8. Data Quality:

The data appears to be clean and well-structured
No obvious encoding issues or unexpected data types were observed

IMPLEMENTATION DETAILS 🛠️

We followed artifact creation method means we have custom Python functions for analysis and visualization, wrapped in a modular script. This approach allows for easy reuse, modification, and integration into larger systems or notebooks. We could have used a pipeline architecture or object-oriented approach, but our functional approach is straightforward and flexible for this analysis.

Data Loading and Preprocessing ✨

Technique: Pandas for data loading, text preprocessing using Python string methods and regular expressions.

Reason: Pandas is efficient for handling structured data like CSV files. For text preprocessing, built-in Python methods are fast and flexible for operations like lowercasing and removing special characters.

Alternative: We could have used databases like SQLite for data loading, but Pandas is more suitable for in-memory processing of moderately sized datasets.

Text Vectorization 🔤

Technique: TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer

Reason: TF-IDF is excellent for converting text data into numerical features. It captures the importance of words in documents relative to the entire corpus, which is crucial for understanding the significance of terms in queries.

Alternative: We could have used simpler methods like CountVectorizer or more complex ones like Word2Vec. However, TF-IDF provides a good balance between simplicity and effectiveness, especially for short texts like queries.

Dimensionality Reduction 📉

Techniques: PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding)

Reason:

PCA: Used for initial dimensionality reduction. It's fast and preserves global structure, making it suitable for visualizing overall data distribution.
t-SNE: Used for more detailed visualization. It's excellent at preserving local structures and revealing clusters in high-dimensional data.

Alternative: UMAP (Uniform Manifold Approximation and Projection) could have been used instead of t-SNE. While potentially faster, we chose t-SNE for its established reputation in visualizing clusters in text data.

Clustering Algorithm 🧩

Technique: K-means clustering

Reason: K-means is efficient, scalable, and works well with numerical data (our TF-IDF vectors). It's particularly good for finding spherical clusters and is interpretable.

Alternative: We could have used hierarchical clustering or DBSCAN. However, K-means is faster for large datasets and doesn't require distance threshold tuning like DBSCAN.

Number of Clusters 🔢

Approach: We used a fixed number of clusters (5)

Reason: This was likely based on initial experiments or domain knowledge about expected query categories.

Alternative: We could have used techniques like the elbow method or silhouette analysis to determine the optimal number of clusters algorithmically. But having a predetermined, manageable number of clusters can make the results easier to interpret and act upon, especially if the goal is to identify broad categories of user queries.

Visualization 📈

Techniques: Matplotlib and Seaborn for scatter plots, WordCloud for visualizing frequent terms

Reason: These libraries offer a good balance of customization and ease of use. Scatter plots effectively show cluster distributions, while word clouds provide an intuitive representation of frequent terms in each cluster.

Alternative: Plotly could have been used for interactive visualizations, but static plots are sufficient for our analysis and easier to embed in reports.

Language Handling 🌐

Approach: We maintained language information alongside queries and analyzed language distribution within clusters.

Reason: This allows us to understand how queries differ across languages and identify language-specific trends.

Alternative: We could have created separate models for each language, but our approach allows for cross-language analysis of similar topics.

(Outputs and Cluster analysis is explained in the colab notebook shared👆)

🧠 Insights about what IQ GPT users are looking for

Price and Market Information

Users are heavily interested in cryptocurrency prices, especially for major coins like Bitcoin and Ethereum.
There's a strong demand for current prices, price predictions, and price comparisons across different exchanges.
Market cap information and price trends (e.g., greatest increases) are frequently queried.

Focus areas:

Ensure real-time, accurate price data across multiple cryptocurrencies.
Develop more sophisticated price analysis and prediction tools.
Implement features for easy price comparisons across exchanges.

Bitcoin-Specific Information

There's a particularly high interest in Bitcoin, warranting special attention.
Users ask about Bitcoin price, market cap, trading volumes, and future trend

Focus areas:

Create a dedicated Bitcoin information section with comprehensive, up-to-date data.
Develop Bitcoin-specific analysis tools and insights. [We can make tools for IQ token as well]

Cryptocurrency Fundamentals

Many queries are about the basics of specific cryptocurrencies, their creators, and underlying technology.
Cardano, in particular, seems to generate a lot of interest regarding its creation and founder. (this may be because of first recommendation query of IQ GPT and also I saw devs using this query for tests as well😅)

Focus areas:

Expand and regularly update the knowledge base on various cryptocurrencies, especially emerging ones.
Provide clear, concise information about cryptocurrency founders and creation stories.

Diverse Crypto Topics

Users asking about a wide range of topics including blockchain technology, tokens, smart contracts, and crypto applications.

Focus areas:

Maintain a broad knowledge base covering various aspects of cryptocurrency and blockchain technology.
Stay updated on new developments, projects, and trends in the crypto space.

Multilingual Support

While English dominates, there's significant interest from Korean and Chinese speakers, with some Japanese queries as well.

Focus areas:

Enhance multilingual support, especially for Korean and Chinese languages.
Consider creating language-specific resources for non-English users.

IQ GPT Tool Usage

Users are interested in learning how to effectively use IQ GPT's features and tools.

Focus areas:

Improve user documentation and provide interactive tutorials on using IQ GPT effectively.
Continuously refine the user interface to make it more intuitive.

Types of Questions to Expect

1. Price-related:

"What's the current price of [cryptocurrency]?" "How much has [cryptocurrency] increased in the last year?" "What's the price difference of Bitcoin between [exchange A] and [exchange B]?"

2. Market analysis:

"Which cryptocurrencies have seen the greatest increase this year?" "What's the market cap of [cryptocurrency]?" "What's the trading volume of Bitcoin in the last 24 hours?"

3. Cryptocurrency basics:

"Who created [cryptocurrency]?" "What is [cryptocurrency] used for?" "How does [blockchain technology] work?"

4. Investment and trading:

"What are the best cryptocurrencies to invest in right now?" "How do I start trading cryptocurrencies?" "What's the forecast for [cryptocurrency] price in the next month?"

5. Technical queries:

"How does a smart contract work?" "What's the difference between PoW and PoS?" "How does [specific cryptocurrency] solve scalability issues?"

6. Current events and news:

"What's the latest development in [cryptocurrency project]?" "How will [recent event] affect the crypto market?"

7. Tool-specific questions:

"How do I use [specific feature] in IQ GPT?" "Can IQ GPT help me with [specific task]?"

💻 colab file

Overview of Clustering Results

The clustering analysis of IQ GPT user queries resulted in 20 distinct clusters, with cluster sizes ranging from 439 to 15,563 queries. This distribution reveals both broad trends and niche interests among users.

Persistent Large Cluster

Cluster 0, containing 15,563 queries, remains significantly larger than others despite increasing the number of clusters. This suggests: a) A core set of general-purpose queries that are difficult to separate further. b) Potential limitations in our clustering approach for certain types of queries. c) A need for more sophisticated natural language processing techniques to differentiate these queries.

Language Distribution

English dominates across all clusters, indicating it's the primary language of IQ GPT users. Chinese (zh) and Korean (kr) appear consistently across clusters, suggesting a significant user base for these languages. Japanese (ja) appears in smaller numbers, primarily in the largest cluster.

Types of Queries to Expect

Based on the clustering results, we can expect the following types of queries:

a) Price and Market Information (Clusters 0, 2, 7, 11)

Current prices of cryptocurrencies (especially Bitcoin, Ethereum, IQ token) Price comparisons between exchanges (e.g., Upbit, Binance) Historical price data and averages

b) Cryptocurrency-Specific Queries (Clusters 3, 8, 13, 17)

Information about specific cryptocurrencies (e.g., Cardano, Frax, Ethereum) Creation and founders of cryptocurrencies Technical aspects of blockchain and specific crypto projects

c) Market Analysis and Trends (Clusters 1, 6, 9, 15)

Crypto market trends and analysis Market capitalization information Lists of top-performing or trending coins

d) IQ Token and Platform-Specific Queries (Clusters 2, 5, 12, 19)

Information about the IQ token and its price Queries about IQ Wiki and IQ GPT functionalities Questions about meme coins and stable coins

e) Technical and Educational Queries (Clusters 16, 17)

Blockchain technology questions DeFi-related queries General cryptocurrency knowledge and use cases

f) Meta-Queries and Tool Usage (Cluster 14)

Questions about how to use IQ GPT and its tools Queries related to the functioning of the AI system

g) Personality and Community Queries (Cluster 18)

Questions about crypto personalities (e.g., Sam Kazemian) Community-related queries (e.g., bans, press)

h) General Conversation and Miscellaneous (Clusters 4, 10)

General conversation starters (e.g., "hello") Queries in other languages (Spanish detected in Cluster 4) Miscellaneous topics not directly related to crypto

Analysis of User Behavior

a) Price-Centric: A significant portion of users are primarily interested in price information, suggesting many use IQ GPT for quick price checks and market monitoring.

b) Educational Use: The presence of clusters focused on blockchain basics and specific cryptocurrencies indicates that users rely on IQ GPT for learning and understanding the crypto space.

c) Investment Focus: Clusters related to market trends, top-performing coins, and price increases suggest users are seeking investment-related information.

d) Platform Engagement: Numerous queries about the IQ token and platform features show active engagement with the IQ ecosystem.

e) Multilingual User Base: While English dominates, the consistent presence of Chinese and Korean queries across clusters indicates a significant international user base.

f) Real-Time Information Seeking: Many queries focus on current prices and recent market movements, indicating users value IQ GPT for up-to-date information.

g) Diverse Interests: The range of clusters shows that while price and market info dominate, users have diverse interests within the crypto space, from technical aspects to community news.

thanks for the analysis. I think we can close it 👍🏻

Total queries: 39,931 Screenshot 2024-08-14 at 4 17 32 PM

We separated these 38%(15,563) queries that are difficult to separate further and performed clustering on these 38% separately with 7 clusters Screenshot 2024-08-14 at 5 00 45 PM

Cluster 0:
Top terms: ['rdrop', 'soquest_chatgpt_bot', 'rdrop soquest_chatgpt_bot', 'buy', 'tell', 'cesar', 'rodriguez', 'cesar rodriguez', 'btc', '아이큐']
Language distribution: {'en': 1211, 'kr': 98, 'zh': 83, 'ja': 1}
Number of queries: 1393

Cluster 1:
Top terms: ['viver', 'twamm', 'start', 'ai', '알려줘', 'nft', 'yuga', 'finance', 'labs', '코인']
Language distribution: {'en': 1918, 'kr': 179, 'zh': 123, 'ja': 3}
Number of queries: 2223

Cluster 2:
Top terms: ['什么是frax', 'volume', 'iqwikibot', 'hi', 'binance', 'trading', 'tvl', 'protocol', 'trading volume', 'orbs']
Language distribution: {'en': 5248, 'zh': 905, 'kr': 150, 'ja': 1}
Number of queries: 6304

Cluster 3:
Top terms: ['wiki', 'title', 'wiki title', 'tell', 'information', 'information wiki', 'generate', 'additional information', 'additional', 'generate additional']
Language distribution: {'en': 1884, 'zh': 206, 'kr': 131, 'ja': 2}
Number of queries: 2223

Cluster 4:
Top terms: ['hello', 'tell', '比特币的价格', '创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', '비트코인', '아이큐', 'chain']
Language distribution: {'en': 930, 'zh': 82, 'kr': 46, 'ja': 3}
Number of queries: 1061

Cluster 5:
Top terms: ['创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', 'fxs最高多少美元', '얼마입니까', '알려주세요', '가격은', '가격은 얼마입니까', '是由谁创立的']
Language distribution: {'en': 48, 'zh': 45, 'kr': 43}
Number of queries: 136

Cluster 6:
Top terms: ['price', 'btc', 'price xrp', 'xrp', 'yesterday', 'xrp yesterday', 'price btc', 'year', 'btc price', 'price year']
Language distribution: {'en': 2025, 'zh': 149, 'kr': 44, 'ja': 5}
Number of queries: 2223

Key Findings:

Dominant Clusters and User Intent:

Cluster 2: The largest cluster, accounting for 38% of the queries, is heavily focused on detailed information about cryptocurrency trading, particularly on platforms like Binance. Users are frequently inquiring about trading volumes, protocols, and specific tokens (e.g., "Frax"). The dominance of English and Chinese languages in this cluster indicates a global interest in trading data and protocol specifics.
Cluster 1: This cluster is primarily concerned with emerging crypto finance topics, such as TWAMM (Time-Weighted Average Market Maker), NFTs, and specific projects like Yuga Labs. The presence of Korean queries suggests a significant interest from the Korean crypto community, alongside a strong English-speaking user base.

Language Distribution and Localization Needs:

Across all clusters, English remains the predominant language, reflecting a broad, global user base. However, there is a notable presence of non-English queries, particularly in Korean, Chinese, and to a lesser extent, Japanese.
Cluster 0 and Cluster 4 show a higher concentration of Korean and Chinese queries related to specific terms like "아이큐" (IQ in Korean) and "比特币" (Bitcoin in Chinese). This suggests a need for more localized content and possibly language-specific features to cater to these user groups.

Topic-Specific Clusters:

Cluster 3 reveals a focus on seeking detailed, wiki-style information. Users are asking IQ GPT to generate and provide additional information about specific crypto-related topics. This indicates a demand for deeper, encyclopedic content, possibly for educational or research purposes.
Cluster 6 is centered on price inquiries, particularly concerning major cryptocurrencies like Bitcoin (BTC) and XRP. Users are interested in historical prices and trends, which could indicate a need for enhanced predictive tools and real-time data feeds on the platform.

Smaller, Specialized Clusters:

Cluster 5 is notably smaller but significant, with users asking very specific questions, often in a mix of languages. For instance, queries about price lists for high-market-cap tokens and the founding information of certain tokens are frequent. This cluster’s multilingual nature suggests a user base that is seeking highly specialized information that might not be widely available elsewhere.

EveripediaNetwork / issues

Develop Query Analysis and Clustering Script for IQ GPT User Insights #3004

Description:

Requirements:

Technical Details:

Expected Outputs:

DATASET EDA 📊

IMPLEMENTATION DETAILS 🛠️

Data Loading and Preprocessing ✨

Text Vectorization 🔤

Dimensionality Reduction 📉

Clustering Algorithm 🧩

Number of Clusters 🔢

Visualization 📈

Language Handling 🌐

🧠 Insights about what IQ GPT users are looking for

Price and Market Information

Bitcoin-Specific Information

Cryptocurrency Fundamentals

Diverse Crypto Topics

Multilingual Support

IQ GPT Tool Usage

Types of Questions to Expect

1. Price-related:

2. Market analysis:

3. Cryptocurrency basics:

4. Investment and trading:

5. Technical queries:

6. Current events and news:

7. Tool-specific questions:

Overview of Clustering Results

Persistent Large Cluster

Language Distribution

Types of Queries to Expect

a) Price and Market Information (Clusters 0, 2, 7, 11)

b) Cryptocurrency-Specific Queries (Clusters 3, 8, 13, 17)

c) Market Analysis and Trends (Clusters 1, 6, 9, 15)

d) IQ Token and Platform-Specific Queries (Clusters 2, 5, 12, 19)

e) Technical and Educational Queries (Clusters 16, 17)

f) Meta-Queries and Tool Usage (Cluster 14)

g) Personality and Community Queries (Cluster 18)

h) General Conversation and Miscellaneous (Clusters 4, 10)

Analysis of User Behavior

Key Findings:

Dominant Clusters and User Intent:

Language Distribution and Localization Needs:

Topic-Specific Clusters:

Smaller, Specialized Clusters: