In this lab, you will be using Git and Github to fork, clone, commit, and push changes to a repository. The repository you will select to use as the repository to fork to your own Github account can be one of the following:
If you are starting with a new repository, fork and clone the repository you selected to your local machine. Then orient yourself to the repository by opening the README file and reviewing the template configuration.
If you are using an existing reproducible research project repository, open that project on your local machine, and pull
the latest changes from the remote repository to ensure that your local and remote repositories are in sync.
Open a Quarto document in the process directory and name it accordingly (e.g., 4_analysis.qmd
, analysis.qmd
, etc.).
The data you select to explore should be in a format conducive for a text classification task. The options include the following:
In your analysis process file,
add a section which provides a brief description of the dataset you will be using and the task setup and aims. Include:
add a section which provides a description of the potential features you will use and engineer in this task. Include:
add a section which provides a description of the modeling process you will be using to perform the task. Include:
Implement and make sure to organize your analysis process in a way that is reproducible. This means that you should be able to run the code in your process file and reproduce the process (use set.seed()
for any sampling process, for example). Use the data/analysis
(or similar) directory to store any derived datasets used in the analysis.
Make sure that your code is well documented with code comments and that you have included prose to describe the process of analyzing the dataset.
Include a section to describe the results of your analysis. This will contain the results of the model evaluation process, and may include an exploration of feature importance.
Confirm that your code runs without errors and that the code, visualizations, and/ or tables are displayed as expected.
Finally, commit and push your changes to your Github repository. Make sure to include files or directories that you do not have permission to share in your .gitignore
file.
Some questions to consider:
library(qtalrkit)
# Acquire ---------------
get_compressed_data(url = "https://github.com/nlp-unibuc/nlp-unibuc-website/releases/download/v1.0/ENNTT.tar.gz", target_dir = "data/original/")
# Curate ---------------
df <- curate_enntt_data("data/original/ENNTT") # This will take a while!
This will provide you with a curated dataset with the following variables:
Variable | Name | Description |
---|---|---|
session_id | Session ID | The session ID of the speech |
speaker_id | Speaker ID | The speaker ID of the speech |
state | State | The political state of the speaker |
session_seq | Session sequence | The sequence of the speech in the session |
text | Text | The text of the speech |
type | Type | The type of language (i.e. Native, Non-Native, or Translation) |
You will subsequently transform this dataset in accordance with the task you select to perform.
# Transform --------------
df |>
...
Note on a few standard transformations:
One other, non-standard transformation you may want to consider concerns the number of observations in each class. There is a significant imbalance in the number of observations in each type
:
type | n | percent |
---|---|---|
natives | 116,341 | 13.15% |
nonnatives | 29,734 | 3.36% |
translations | 738,597 | 83.49% |
In the case of classification, this imbalance may lead to a model that is biased towards the majority class. You may want to consider balancing the number of observations in each class in the building of your recipe with the step_downsample()
function from the themis
package.
library(themis)
recipe(
type ~ .,
data = df
) |>
step_downsample(type)