Closed Softdev1 closed 2 months ago
1. Dataset Overview
2. Data Types:
3. Missing Values:
4. Language Distribution:
5. Query Characteristics:
6. Query Length Distribution:
7. Sample Data: We saw a few sample queries, which were primarily questions about cryptocurrency topics, such as:
8. Data Quality:
We followed artifact creation method means we have custom Python functions for analysis and visualization, wrapped in a modular script. This approach allows for easy reuse, modification, and integration into larger systems or notebooks. We could have used a pipeline architecture or object-oriented approach, but our functional approach is straightforward and flexible for this analysis.
Technique: Pandas for data loading, text preprocessing using Python string methods and regular expressions.
Reason: Pandas is efficient for handling structured data like CSV files. For text preprocessing, built-in Python methods are fast and flexible for operations like lowercasing and removing special characters.
Alternative: We could have used databases like SQLite for data loading, but Pandas is more suitable for in-memory processing of moderately sized datasets.
Technique: TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer
Reason: TF-IDF is excellent for converting text data into numerical features. It captures the importance of words in documents relative to the entire corpus, which is crucial for understanding the significance of terms in queries.
Alternative: We could have used simpler methods like CountVectorizer or more complex ones like Word2Vec. However, TF-IDF provides a good balance between simplicity and effectiveness, especially for short texts like queries.
Techniques: PCA (Principal Component Analysis) and t-SNE (t-Distributed Stochastic Neighbor Embedding)
Reason:
Alternative: UMAP (Uniform Manifold Approximation and Projection) could have been used instead of t-SNE. While potentially faster, we chose t-SNE for its established reputation in visualizing clusters in text data.
Technique: K-means clustering
Reason: K-means is efficient, scalable, and works well with numerical data (our TF-IDF vectors). It's particularly good for finding spherical clusters and is interpretable.
Alternative: We could have used hierarchical clustering or DBSCAN. However, K-means is faster for large datasets and doesn't require distance threshold tuning like DBSCAN.
Approach: We used a fixed number of clusters (5)
Reason: This was likely based on initial experiments or domain knowledge about expected query categories.
Alternative: We could have used techniques like the elbow method or silhouette analysis to determine the optimal number of clusters algorithmically. But having a predetermined, manageable number of clusters can make the results easier to interpret and act upon, especially if the goal is to identify broad categories of user queries.
Techniques: Matplotlib and Seaborn for scatter plots, WordCloud for visualizing frequent terms
Reason: These libraries offer a good balance of customization and ease of use. Scatter plots effectively show cluster distributions, while word clouds provide an intuitive representation of frequent terms in each cluster.
Alternative: Plotly could have been used for interactive visualizations, but static plots are sufficient for our analysis and easier to embed in reports.
Approach: We maintained language information alongside queries and analyzed language distribution within clusters.
Reason: This allows us to understand how queries differ across languages and identify language-specific trends.
Alternative: We could have created separate models for each language, but our approach allows for cross-language analysis of similar topics.
(Outputs and Cluster analysis is explained in the colab notebook shared👆)
Focus areas:
Focus areas:
Focus areas:
Focus areas:
Focus areas:
Focus areas:
"What's the current price of [cryptocurrency]?" "How much has [cryptocurrency] increased in the last year?" "What's the price difference of Bitcoin between [exchange A] and [exchange B]?"
"Which cryptocurrencies have seen the greatest increase this year?" "What's the market cap of [cryptocurrency]?" "What's the trading volume of Bitcoin in the last 24 hours?"
"Who created [cryptocurrency]?" "What is [cryptocurrency] used for?" "How does [blockchain technology] work?"
"What are the best cryptocurrencies to invest in right now?" "How do I start trading cryptocurrencies?" "What's the forecast for [cryptocurrency] price in the next month?"
"How does a smart contract work?" "What's the difference between PoW and PoS?" "How does [specific cryptocurrency] solve scalability issues?"
"What's the latest development in [cryptocurrency project]?" "How will [recent event] affect the crypto market?"
"How do I use [specific feature] in IQ GPT?" "Can IQ GPT help me with [specific task]?"
The clustering analysis of IQ GPT user queries resulted in 20 distinct clusters, with cluster sizes ranging from 439 to 15,563 queries. This distribution reveals both broad trends and niche interests among users.
Cluster 0, containing 15,563 queries, remains significantly larger than others despite increasing the number of clusters. This suggests: a) A core set of general-purpose queries that are difficult to separate further. b) Potential limitations in our clustering approach for certain types of queries. c) A need for more sophisticated natural language processing techniques to differentiate these queries.
English dominates across all clusters, indicating it's the primary language of IQ GPT users. Chinese (zh) and Korean (kr) appear consistently across clusters, suggesting a significant user base for these languages. Japanese (ja) appears in smaller numbers, primarily in the largest cluster.
Based on the clustering results, we can expect the following types of queries:
Current prices of cryptocurrencies (especially Bitcoin, Ethereum, IQ token) Price comparisons between exchanges (e.g., Upbit, Binance) Historical price data and averages
Information about specific cryptocurrencies (e.g., Cardano, Frax, Ethereum) Creation and founders of cryptocurrencies Technical aspects of blockchain and specific crypto projects
Crypto market trends and analysis Market capitalization information Lists of top-performing or trending coins
Information about the IQ token and its price Queries about IQ Wiki and IQ GPT functionalities Questions about meme coins and stable coins
Blockchain technology questions DeFi-related queries General cryptocurrency knowledge and use cases
Questions about how to use IQ GPT and its tools Queries related to the functioning of the AI system
Questions about crypto personalities (e.g., Sam Kazemian) Community-related queries (e.g., bans, press)
General conversation starters (e.g., "hello") Queries in other languages (Spanish detected in Cluster 4) Miscellaneous topics not directly related to crypto
a) Price-Centric: A significant portion of users are primarily interested in price information, suggesting many use IQ GPT for quick price checks and market monitoring.
b) Educational Use: The presence of clusters focused on blockchain basics and specific cryptocurrencies indicates that users rely on IQ GPT for learning and understanding the crypto space.
c) Investment Focus: Clusters related to market trends, top-performing coins, and price increases suggest users are seeking investment-related information.
d) Platform Engagement: Numerous queries about the IQ token and platform features show active engagement with the IQ ecosystem.
e) Multilingual User Base: While English dominates, the consistent presence of Chinese and Korean queries across clusters indicates a significant international user base.
f) Real-Time Information Seeking: Many queries focus on current prices and recent market movements, indicating users value IQ GPT for up-to-date information.
g) Diverse Interests: The range of clusters shows that while price and market info dominate, users have diverse interests within the crypto space, from technical aspects to community news.
thanks for the analysis. I think we can close it 👍🏻
Total queries: 39,931
We separated these 38%(15,563) queries that are difficult to separate further and performed clustering on these 38% separately with 7 clusters
Cluster 0:
Top terms: ['rdrop', 'soquest_chatgpt_bot', 'rdrop soquest_chatgpt_bot', 'buy', 'tell', 'cesar', 'rodriguez', 'cesar rodriguez', 'btc', '아이큐']
Language distribution: {'en': 1211, 'kr': 98, 'zh': 83, 'ja': 1}
Number of queries: 1393
Cluster 1:
Top terms: ['viver', 'twamm', 'start', 'ai', '알려줘', 'nft', 'yuga', 'finance', 'labs', '코인']
Language distribution: {'en': 1918, 'kr': 179, 'zh': 123, 'ja': 3}
Number of queries: 2223
Cluster 2:
Top terms: ['什么是frax', 'volume', 'iqwikibot', 'hi', 'binance', 'trading', 'tvl', 'protocol', 'trading volume', 'orbs']
Language distribution: {'en': 5248, 'zh': 905, 'kr': 150, 'ja': 1}
Number of queries: 6304
Cluster 3:
Top terms: ['wiki', 'title', 'wiki title', 'tell', 'information', 'information wiki', 'generate', 'additional information', 'additional', 'generate additional']
Language distribution: {'en': 1884, 'zh': 206, 'kr': 131, 'ja': 2}
Number of queries: 2223
Cluster 4:
Top terms: ['hello', 'tell', '比特币的价格', '创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', '비트코인', '아이큐', 'chain']
Language distribution: {'en': 930, 'zh': 82, 'kr': 46, 'ja': 3}
Number of queries: 1061
Cluster 5:
Top terms: ['创建一个今年价格上涨幅度最大且市值超过1亿美元的代币列表', 'cesar', 'cesar rodriguez', 'rodriguez', 'fxs最高多少美元', '얼마입니까', '알려주세요', '가격은', '가격은 얼마입니까', '是由谁创立的']
Language distribution: {'en': 48, 'zh': 45, 'kr': 43}
Number of queries: 136
Cluster 6:
Top terms: ['price', 'btc', 'price xrp', 'xrp', 'yesterday', 'xrp yesterday', 'price btc', 'year', 'btc price', 'price year']
Language distribution: {'en': 2025, 'zh': 149, 'kr': 44, 'ja': 5}
Number of queries: 2223
Cluster 2: The largest cluster, accounting for 38% of the queries, is heavily focused on detailed information about cryptocurrency trading, particularly on platforms like Binance. Users are frequently inquiring about trading volumes, protocols, and specific tokens (e.g., "Frax"). The dominance of English and Chinese languages in this cluster indicates a global interest in trading data and protocol specifics.
Cluster 1: This cluster is primarily concerned with emerging crypto finance topics, such as TWAMM (Time-Weighted Average Market Maker), NFTs, and specific projects like Yuga Labs. The presence of Korean queries suggests a significant interest from the Korean crypto community, alongside a strong English-speaking user base.
Across all clusters, English remains the predominant language, reflecting a broad, global user base. However, there is a notable presence of non-English queries, particularly in Korean, Chinese, and to a lesser extent, Japanese.
Cluster 0 and Cluster 4 show a higher concentration of Korean and Chinese queries related to specific terms like "아이큐" (IQ in Korean) and "比特币" (Bitcoin in Chinese). This suggests a need for more localized content and possibly language-specific features to cater to these user groups.
Cluster 3 reveals a focus on seeking detailed, wiki-style information. Users are asking IQ GPT to generate and provide additional information about specific crypto-related topics. This indicates a demand for deeper, encyclopedic content, possibly for educational or research purposes.
Cluster 6 is centered on price inquiries, particularly concerning major cryptocurrencies like Bitcoin (BTC) and XRP. Users are interested in historical prices and trends, which could indicate a need for enhanced predictive tools and real-time data feeds on the platform.
Description:
Create a Python script or Jupyter notebook to analyze and categorize user queries to IQ GPT, providing insights into the most requested topics and types of questions.
Requirements:
Technical Details:
Expected Outputs: