In this lab, you will be using Git and Github to fork, clone, commit, and push changes to a repository. The repository you will select to use as the repository to fork to your own Github account can be one of the following:
If you are starting with a new repository, fork and clone the repository you selected to your local machine. Then orient yourself to the repository by opening the README file and reviewing the template configuration.
If you are using an existing reproducible research project repository, open that project on your local machine, and pull
the latest changes from the remote repository to ensure that your local and remote repositories are in sync.
Open a Quarto document in the process directory and name it accordingly (e.g., 4_analysis.qmd
, analysis.qmd
, etc.).
The data you select to explore should be in a format conducive for exploratory analysis. The options include the following:
In your analysis process file,
add a section which provides a brief description of the dataset you will be exploring and what your primary research questions are. Include:
add a section which provides a description of the analytical process you will be using to explore this(ese) question(s). Include:
add a section for each analytical process you will be using to explore the question(s). In this section, you will document with code, code comments, and prose the process of exploring the data. This is where you will craft the code to explore the data. Feel free to use existing R packages and functions as you see fit.
Make sure to organize your analysis process in a way that is reproducible. This means that you should be able to run the code in your process file and reproduce the process (use set.seed()
for any sampling process, for example). Use the data/analysis
(or similar) directory to store any derived datasets used in the analysis.
Make sure that your code is well documented with code comments and that you have included prose to describe the process of analyzing the dataset.
Include a section to describe the results of your analysis.
Confirm that your code runs without errors and that the code, visualizations, and/ or tables are displayed as expected.
Finally, commit and push your changes to your Github repository. Make sure to include files or directories that you do not have permission to share in your .gitignore
file.
Some questions to consider:
To acquire the dataset, you may use the get_gutenberg_works()
function from the qtalrkit
package. (See the documentation)
The Library of Congress codes for British and American Literature are "PR" and "PS" respectively. You can then use the birth year and death year for the authors as 1800 and 1880.
Run the following code to acquire the dataset[^1]:
[^1]: Note you will need and internet connection to run this code and it make take a few minutes to run.
library(qtalrkit)
# Acquire ---------------
# Get the American works
get_gutenberg_works(
target_dir = "../data/original",
lcc_subject = "PS",
birth_year = 1800,
death_year = 1880
)
# Read `works_ps.csv`
works_ps <- readr::read_csv("../data/original/works_ps.csv")
# Get the British works
get_gutenberg_works(
target_dir = "../data/original",
lcc_subject = "PR",
birth_year = 1800,
death_year = 1880
)
# Read `works_pr.csv`
works_pr <- readr::read_csv("../data/original/works_pr.csv")
# Transform ---------------
# Combine the two datasets
works <- dplyr::bind_rows(works_ps, works_pr)
# Collapse `text` by `gutenberg_id`
works <-
filter(works, !is.na(text)) |>
dplyr::group_by(gutenberg_id, lcc, author, title) |>
dplyr::summarize(works, text = paste(text, collapse = " ")) |>
dplyr::ungroup()
This work is licensed under a Creative Commons Attribution 4.0 International License.