EveripediaNetwork / issues

Issues repo
9 stars 0 forks source link

Develop Query Analysis and Clustering Script for IQ GPT User Insights #3004

Closed Softdev1 closed 2 months ago

Softdev1 commented 3 months ago

Description:

Create a Python script or Jupyter notebook to analyze and categorize user queries to IQ GPT, providing insights into the most requested topics and types of questions.

Requirements:

  1. Implement clustering algorithms to group similar queries
  2. Analyze query patterns to identify most common topics
  3. Categorize queries based on their type (e.g., price inquiries, technical questions, general information)
  4. Generate visualizations to represent the clustered data
  5. Provide summary statistics on query distribution

Technical Details:

  1. Use Python for implementation
  2. Utilize appropriate clustering tools and libraries (e.g., scikit-learn, NLTK)
  3. Ensure the script can handle large datasets efficiently
  4. Include data preprocessing steps for cleaning and normalizing queries

Expected Outputs:

  1. A Python script or Jupyter notebook with well-commented code
  2. Visualizations of clustered data (e.g., dendrograms, scatter plots)
  3. Summary report of finding
Yadheedhya06 commented 3 months ago

Colab - https://colab.research.google.com/drive/13eN_aJPfErCfFnNkTxdWaGii90Gkp1Nx?usp=sharing

Yadheedhya06 commented 3 months ago

DATASET EDA 📊

1. Dataset Overview

2. Data Types:

3. Missing Values:

4. Language Distribution:

5. Query Characteristics:

6. Query Length Distribution:

7. Sample Data: We saw a few sample queries, which were primarily questions about cryptocurrency topics, such as:

8. Data Quality:

Yadheedhya06 commented 3 months ago

IMPLEMENTATION DETAILS 🛠️

We followed artifact creation method means we have custom Python functions for analysis and visualization, wrapped in a modular script. This approach allows for easy reuse, modification, and integration into larger systems or notebooks. We could have used a pipeline architecture or object-oriented approach, but our functional approach is straightforward and flexible for this analysis.

Data Loading and Preprocessing ✨

Technique: Pandas for data loading, text preprocessing using Python string methods and regular expressions.

Reason: Pandas is efficient for handling structured data like CSV files. For text preprocessing, built-in Python methods are fast and flexible for operations like lowercasing and removing special characters.

Alternative: We could have used databases like SQLite for data loading, but Pandas is more suitable for in-memory processing of moderately sized datasets.

Text Vectorization 🔤

Technique: TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer

Reason: TF-IDF is excellent for converting text data into numerical features. It captures the importance of words in documents relative to the entire corpus, which is crucial for understanding the significance of terms in queries.

Alternative: We could have used simpler methods like CountVectorizer or more complex ones like Word2Vec. However, TF-IDF provides a good balance between simplicity and effectiveness, especially for short texts like queries.

Dimensionality Reduction 📉

Techniques: PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding)

Reason:

Alternative: UMAP (Uniform Manifold Approximation and Projection) could have been used instead of t-SNE. While potentially faster, we chose t-SNE for its established reputation in visualizing clusters in text data.

Clustering Algorithm 🧩

Technique: K-means clustering

Reason: K-means is efficient, scalable, and works well with numerical data (our TF-IDF vectors). It's particularly good for finding spherical clusters and is interpretable.

Alternative: We could have used hierarchical clustering or DBSCAN. However, K-means is faster for large datasets and doesn't require distance threshold tuning like DBSCAN.

Number of Clusters 🔢

Approach: We used a fixed number of clusters (5)

Reason: This was likely based on initial experiments or domain knowledge about expected query categories.

Alternative: We could have used techniques like the elbow method or silhouette analysis to determine the optimal number of clusters algorithmically. But having a predetermined, manageable number of clusters can make the results easier to interpret and act upon, especially if the goal is to identify broad categories of user queries.

Visualization 📈

Techniques: Matplotlib and Seaborn for scatter plots, WordCloud for visualizing frequent terms

Reason: These libraries offer a good balance of customization and ease of use. Scatter plots effectively show cluster distributions, while word clouds provide an intuitive representation of frequent terms in each cluster.

Alternative: Plotly could have been used for interactive visualizations, but static plots are sufficient for our analysis and easier to embed in reports.

Language Handling 🌐

Approach: We maintained language information alongside queries and analyzed language distribution within clusters.

Reason: This allows us to understand how queries differ across languages and identify language-specific trends.

Alternative: We could have created separate models for each language, but our approach allows for cross-language analysis of similar topics.

Yadheedhya06 commented 3 months ago

(Outputs and Cluster analysis is explained in the colab notebook shared👆)

🧠 Insights about what IQ GPT users are looking for

Price and Market Information

Focus areas:

  1. Ensure real-time, accurate price data across multiple cryptocurrencies.
  2. Develop more sophisticated price analysis and prediction tools.
  3. Implement features for easy price comparisons across exchanges.

Bitcoin-Specific Information

Focus areas:

  1. Create a dedicated Bitcoin information section with comprehensive, up-to-date data.
  2. Develop Bitcoin-specific analysis tools and insights. [We can make tools for IQ token as well]

Cryptocurrency Fundamentals

Focus areas:

  1. Expand and regularly update the knowledge base on various cryptocurrencies, especially emerging ones.
  2. Provide clear, concise information about cryptocurrency founders and creation stories.

Diverse Crypto Topics

Focus areas:

  1. Maintain a broad knowledge base covering various aspects of cryptocurrency and blockchain technology.
  2. Stay updated on new developments, projects, and trends in the crypto space.

Multilingual Support

Focus areas:

  1. Enhance multilingual support, especially for Korean and Chinese languages.
  2. Consider creating language-specific resources for non-English users.

IQ GPT Tool Usage

Focus areas:

  1. Improve user documentation and provide interactive tutorials on using IQ GPT effectively.
  2. Continuously refine the user interface to make it more intuitive.

Types of Questions to Expect

1. Price-related:

"What's the current price of [cryptocurrency]?" "How much has [cryptocurrency] increased in the last year?" "What's the price difference of Bitcoin between [exchange A] and [exchange B]?"

2. Market analysis:

"Which cryptocurrencies have seen the greatest increase this year?" "What's the market cap of [cryptocurrency]?" "What's the trading volume of Bitcoin in the last 24 hours?"

3. Cryptocurrency basics:

"Who created [cryptocurrency]?" "What is [cryptocurrency] used for?" "How does [blockchain technology] work?"

4. Investment and trading:

"What are the best cryptocurrencies to invest in right now?" "How do I start trading cryptocurrencies?" "What's the forecast for [cryptocurrency] price in the next month?"

5. Technical queries:

"How does a smart contract work?" "What's the difference between PoW and PoS?" "How does [specific cryptocurrency] solve scalability issues?"

6. Current events and news:

"What's the latest development in [cryptocurrency project]?" "How will [recent event] affect the crypto market?"

7. Tool-specific questions:

"How do I use [specific feature] in IQ GPT?" "Can IQ GPT help me with [specific task]?"

Yadheedhya06 commented 2 months ago

💻 colab file

Overview of Clustering Results

The clustering analysis of IQ GPT user queries resulted in 20 distinct clusters, with cluster sizes ranging from 439 to 15,563 queries. This distribution reveals both broad trends and niche interests among users.

Persistent Large Cluster

Cluster 0, containing 15,563 queries, remains significantly larger than others despite increasing the number of clusters. This suggests: a) A core set of general-purpose queries that are difficult to separate further. b) Potential limitations in our clustering approach for certain types of queries. c) A need for more sophisticated natural language processing techniques to differentiate these queries.

Language Distribution

English dominates across all clusters, indicating it's the primary language of IQ GPT users. Chinese (zh) and Korean (kr) appear consistently across clusters, suggesting a significant user base for these languages. Japanese (ja) appears in smaller numbers, primarily in the largest cluster.

Types of Queries to Expect

Based on the clustering results, we can expect the following types of queries:

a) Price and Market Information (Clusters 0, 2, 7, 11)

Current prices of cryptocurrencies (especially Bitcoin, Ethereum, IQ token) Price comparisons between exchanges (e.g., Upbit, Binance) Historical price data and averages

b) Cryptocurrency-Specific Queries (Clusters 3, 8, 13, 17)

Information about specific cryptocurrencies (e.g., Cardano, Frax, Ethereum) Creation and founders of cryptocurrencies Technical aspects of blockchain and specific crypto projects

c) Market Analysis and Trends (Clusters 1, 6, 9, 15)

Crypto market trends and analysis Market capitalization information Lists of top-performing or trending coins

d) IQ Token and Platform-Specific Queries (Clusters 2, 5, 12, 19)

Information about the IQ token and its price Queries about IQ Wiki and IQ GPT functionalities Questions about meme coins and stable coins

e) Technical and Educational Queries (Clusters 16, 17)

Blockchain technology questions DeFi-related queries General cryptocurrency knowledge and use cases

f) Meta-Queries and Tool Usage (Cluster 14)

Questions about how to use IQ GPT and its tools Queries related to the functioning of the AI system

g) Personality and Community Queries (Cluster 18)

Questions about crypto personalities (e.g., Sam Kazemian) Community-related queries (e.g., bans, press)

h) General Conversation and Miscellaneous (Clusters 4, 10)

General conversation starters (e.g., "hello") Queries in other languages (Spanish detected in Cluster 4) Miscellaneous topics not directly related to crypto

Analysis of User Behavior

a) Price-Centric: A significant portion of users are primarily interested in price information, suggesting many use IQ GPT for quick price checks and market monitoring.

b) Educational Use: The presence of clusters focused on blockchain basics and specific cryptocurrencies indicates that users rely on IQ GPT for learning and understanding the crypto space.

c) Investment Focus: Clusters related to market trends, top-performing coins, and price increases suggest users are seeking investment-related information.

d) Platform Engagement: Numerous queries about the IQ token and platform features show active engagement with the IQ ecosystem.

e) Multilingual User Base: While English dominates, the consistent presence of Chinese and Korean queries across clusters indicates a significant international user base.

f) Real-Time Information Seeking: Many queries focus on current prices and recent market movements, indicating users value IQ GPT for up-to-date information.

g) Diverse Interests: The range of clusters shows that while price and market info dominate, users have diverse interests within the crypto space, from technical aspects to community news.

kesar commented 2 months ago

thanks for the analysis. I think we can close it 👍🏻

Yadheedhya06 commented 2 months ago

Total queries: 39,931 Screenshot 2024-08-14 at 4 17 32 PM

Yadheedhya06 commented 2 months ago

We separated these 38%(15,563) queries that are difficult to separate further and performed clustering on these 38% separately with 7 clusters Screenshot 2024-08-14 at 5 00 45 PM

Cluster 0:
Top terms: ['rdrop', 'soquest_chatgpt_bot', 'rdrop soquest_chatgpt_bot', 'buy', 'tell', 'cesar', 'rodriguez', 'cesar rodriguez', 'btc', '아이큐']
Language distribution: {'en': 1211, 'kr': 98, 'zh': 83, 'ja': 1}
Number of queries: 1393

Cluster 1:
Top terms: ['viver', 'twamm', 'start', 'ai', '알려줘', 'nft', 'yuga', 'finance', 'labs', '코인']
Language distribution: {'en': 1918, 'kr': 179, 'zh': 123, 'ja': 3}
Number of queries: 2223

Cluster 2:
Top terms: ['什么是frax', 'volume', 'iqwikibot', 'hi', 'binance', 'trading', 'tvl', 'protocol', 'trading volume', 'orbs']
Language distribution: {'en': 5248, 'zh': 905, 'kr': 150, 'ja': 1}
Number of queries: 6304

Cluster 3:
Top terms: ['wiki', 'title', 'wiki title', 'tell', 'information', 'information wiki', 'generate', 'additional information', 'additional', 'generate additional']
Language distribution: {'en': 1884, 'zh': 206, 'kr': 131, 'ja': 2}
Number of queries: 2223

Cluster 4:
Top terms: ['hello', 'tell', '比特币的价格', '创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', '비트코인', '아이큐', 'chain']
Language distribution: {'en': 930, 'zh': 82, 'kr': 46, 'ja': 3}
Number of queries: 1061

Cluster 5:
Top terms: ['创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', 'fxs最高多少美元', '얼마입니까', '알려주세요', '가격은', '가격은 얼마입니까', '是由谁创立的']
Language distribution: {'en': 48, 'zh': 45, 'kr': 43}
Number of queries: 136

Cluster 6:
Top terms: ['price', 'btc', 'price xrp', 'xrp', 'yesterday', 'xrp yesterday', 'price btc', 'year', 'btc price', 'price year']
Language distribution: {'en': 2025, 'zh': 149, 'kr': 44, 'ja': 5}
Number of queries: 2223

Key Findings:

Dominant Clusters and User Intent:

Language Distribution and Localization Needs:

Topic-Specific Clusters:

Smaller, Specialized Clusters: