Sicheng2000 / lab-08

Lab 08: Pattern discovery
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Lab 08 feedback #2

Open Sicheng2000 opened 3 months ago

Sicheng2000 commented 3 months ago

@francojc

  1. a. Exploratory analysis involves identifying variables potentially linked to research questions, inspecting data for later analysis, interrogating it to provide quantitative measures using descriptive or unsupervised learning methods, and interpreting the results to evaluate whether they address the research questions effectively. b. count() automatically groups and ungroups variables as needed. c. geom_smooth() includes a linear trend line by default. d. The stopwords list is to exclude common words that may affect the final data. e. The lemmatize_words() function is effective in treating identical words across multiple variables as the same, yet it requires a lookup table. f. Dimensionality reduction simplifies features within a dataset, with Principal Component Analysis (PCA) being the most prevalent method.
  2. I find the theory quite understandable; however, the methods section is challenging to grasp, particularly because it involves a lot of statistical or mathematical concepts, such as dimensionality reduction and k-means clustering. I needed to look up additional information for these concepts. Even though I searched online, some explanations include formulas that I haven't learned in my math coursework before.
  3. Because I find it challenging to understand, I mainly rely on information in Chinese. From this video, https://www.xiaohongshu.com/explore/651b1ade000000001e0309fc?app_platform=ios&app_version=8.29&author_share=1&share_from_user_hidden=true&type=video&xhsshare=CopyLink&shareRedId=ODc1NUk-Njo2NzUyOTgwNjY1OTk1OjY9&apptime=1712440474, it discusses aligning data points in the same line by calculating the mean of their x and y values. Moving the data and the point we get according to the (mean) x, and (mean) y to the central point serves to decentralize the data. The line that is closest to all the data points can best describe the distribution of the data. The Pythagorean Theorem can be used to determine this line. Another line perpendicular to the first line can also be found, allowing the lines and data to be switched to establish a new set of x and y axes, forming a new coordinate system.
  4. I don't feel there's anything new I want to learn because I haven't fully understood the current content. So, I came across a video aimed at new learners of exploratory data analysis: https://www.bilibili.com/video/BV1xY411x77p/?spm_id_from=333.337.search-card.all.click&vd_source=f94fa1d4f65b4001146f043fbc7e4b2a. It discusses the purpose of exploratory data analysis and the process involved. It also mentions handling missing data in variables related to research questions by deleting it. Extreme data points are mentioned, suggesting that they may signify abnormalities, especially in small databases where their deletion has minimal impact. The video also covers reading scatter plots and suggests tools for use in exploratory data analysis.
francojc commented 3 months ago

Great job. Yes, the math behind these methods and algorithms can be daunting, especially in formula notation. The most important thing at this point, however, is to have a basic idea of their uses. If you continue to work with these methods, then a deeper dive into the inner workings will likely prove useful.