Sicheng2000 / lab-06

Lab 06: Taming data
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Lab 06 feedback #1

Open Sicheng2000 opened 7 months ago

Sicheng2000 commented 7 months ago
  1. The methods of data curation vary depending on the type of data, whether it is structured, semi-structured, or unstructured. The process of checking the data is crucial as it influences our subsequent actions, such as how to tidy the data and where to store the derived data. Documentation plays a vital role as it enables us to recall the previous data curation process for ourselves and others.
  2. Dealing with unstructured data presents several challenges, particularly when it comes to extracting relevant variables from plain text. Initially, I attempted to use the readLines() function and the stringr package with regular expressions to extract the desired words from my translations in Chinese and the source text in English. However, I later discovered that my files were in Word document format, requiring the use of the readtext() function. Additionally, I realized that adjustments were necessary to ensure the computer could effectively read my Chinese translations. Furthermore, extracting variables from implicit plain text created an additional challenge. In the case of semi-structured data, separating annotations from the text can be particularly challenging. If the pattern I set does not match everything, it becomes difficult to verify the accuracy of the separation. Checking line by line can be time-consuming, especially with a large number of observations. When dealing with structured data, identifying redundant columns or columns that do not provide useful information is crucial. Reorganizing the data in a clearer manner is essential for improving its usability.
  3. For lab-06, I primarily relied on the resources available at https://qtalr.github.io/book/part_3/6_curate.html and https://qtalr.github.io/qtalrkit/articles/recipe-6.html. These materials provided sufficient information for me to understand 2_curate_key.qmd. However, upon completing my assignment, I encountered difficulties pushing my revised version to the GitHub repository. After consulting resources such as https://stackoverflow.com/questions/20370294/git-could-not-resolve-host-github-com-error-while-cloning-remote-repository-in, I tried many ways including unsetting https.proxy, setting my remote URL origin, and changing my access token. Unfortunately, none of these attempts were successful. As a result, I am seeking further guidance from Professor @francojc.
  4. I'm interested in learning more about handling unstructured data. In the current class, we have experience dealing with two text files where the content in the source language aligns with the content in the target language line by line. However, what if the text in the source and target languages is not parallel, and sentences are not separated? I'm aware that the readLines function can be used to obtain parallel text, but based on my own experience, additional steps are needed to obtain"real" parallel text.
francojc commented 7 months ago

@Sicheng2000 Great self-assessment. It is very detailed and demonstrates that you have really put some thought into this lab.

I hope that you can resolve the issue that you are having pushing this repo to GitHub as I would really like to take a look at your code and make specific comments and suggestions there.

Let me know if you need more help on this. If need be, set up an appointment with me.