What did you learn?
Data selection should align with research questions and accessibility considerations.
Data can be acquired through direct download, programmatic download, API access (for public resources), or web scraping.
For manual downloads, provide details such as data origin, filter criteria, data location, data format, and download link.
To streamline repetitive tasks in long-term projects, consider creating functions and storing them in separate documents.
Utilize the create_data_origin() function in the qtalrkit package to generate a CSV file summarizing data attributes.
Programmatically download and extract files using download.file (url = , destfile = ), tempfile(), and untar (tarfile = , exdir = ) functions, eliminating the need to maintain zip files.
The fs package facilitates assigning directories for file storage.
Incorporate control statements to prevent duplicate downloads.
Explore resources such as Wordbank (wordbankr) for child language corpora and TalkBank (TBDBr) for spoken language corpora.
What did you find most/ least challenging?
I think is applying the untar and download.file functions because sometimes I forget the order of the variables and what they should contain. In such cases, I find it useful to prefix the function name with "?" to access its documentation and understand its usage. Another aspect that confuses me is what to do if I lack a CSV file; for instance, if I only have a PDF or HTML file. Currently, I'm working with the Switchboard Dialog Act Corpus, which solely provides UTT and HTML files.
What resources did you consult?
I referred to Recipe 5 (https://qtalr.github.io/qtalrkit/articles/recipe-5.html) for guidance, but it primarily utilizes the Gutenberg package, which already provides data frames. Thus, I'm curious about converting a text file into a CSV table. Through online research and assisting with ChatGPT, it appears that the text needs to be structured with categories such as tokens and IDs to facilitate conversion into a table format. It seems establishing a corpus structure is necessary before proceeding with the conversion.
(from https://www.youtube.com/watch?v=BPjgwdqHM8g)
What more would you like to know about acquiring data?
I want to know more about how to handle the UTT file. When I attempted to read it, the output appeared like this:
create_data_origin()
function in theqtalrkit
package to generate a CSV file summarizing data attributes.download.file (url = , destfile = )
,tempfile()
, anduntar (tarfile = , exdir = )
functions, eliminating the need to maintain zip files.fs
package facilitates assigning directories for file storage.What did you find most/ least challenging? I think is applying the
untar
anddownload.file
functions because sometimes I forget the order of the variables and what they should contain. In such cases, I find it useful to prefix the function name with "?" to access its documentation and understand its usage. Another aspect that confuses me is what to do if I lack a CSV file; for instance, if I only have a PDF or HTML file. Currently, I'm working with the Switchboard Dialog Act Corpus, which solely provides UTT and HTML files.What resources did you consult? I referred to Recipe 5 (https://qtalr.github.io/qtalrkit/articles/recipe-5.html) for guidance, but it primarily utilizes the
Gutenberg
package, which already provides data frames. Thus, I'm curious about converting a text file into a CSV table. Through online research and assisting with ChatGPT, it appears that the text needs to be structured with categories such as tokens and IDs to facilitate conversion into a table format. It seems establishing a corpus structure is necessary before proceeding with the conversion. (from https://www.youtube.com/watch?v=BPjgwdqHM8g)What more would you like to know about acquiring data? I want to know more about how to handle the UTT file. When I attempted to read it, the output appeared like this:
Since I'm only displaying the first 5 lines, I'm interested in how to display a specific line, such as the one containing words like "yeah". I've discovered methods that don't require converting text into CSV format. But all of them seem a little complicated. https://www.reddit.com/r/gamemaker/comments/r8qrz8/searching_for_a_specific_line_in_text_file/