This is a course project for ZOOL710 Data Science in R for Biologists

Overview

A template file and folder structure for a data analysis project/paper done with R/Quarto/Github. The structure promotes a data science pipeline for reproducibility in research.

Figure: The Data Science Pipeline from RD Peng

Projects

For the first project, we are focusing on data cleaning, so we only have Data and Code folders.
For the second project, we are analyzing the cleaned data, so have Data, Code, Results, and Products folders
For the third project, please analyze your own data with data cleaning, analysis, and report preparation.

History

2023-02-16 First release with Data and Code folders with materials for data cleaning and initial exploration.

2023-03-03 Changes made to the processingcode.R, processingfile_v1.qmd, and processingfile_v2.qmd.

2023-03-31 Updates were made to Project 1 which included the processingcode.R, processingfile_v1.qmd, and processingfile_v2.qmd. Project 2 added the Anaylsis_Penguins.R, Analysis_code.qmd, and Analysis_code.html files.

Acknowledgement

This template was cloned and modified from https://github.com/andreashandel/dataanalysis-template

Software requirements

This template lays out a data analysis project and report writing using R, Quarto, Github and a reference manager for bibtex. A plain text editor is also necessary, and wordprocessing software to open .docx if you wish to use that format (e.g. MS Word, MacOS Pages, or LibreOffice).

For more R packages supporting reproducible research check out the taskview https://cran.r-project.org/web/views/ReproducibleResearch.html

Template structure

The template supports the notion that there should only be one copy of code and any outputs. Any time the intellectual content needs to be reused, it should be referred to or linked from the one copy. This way we can prevent different copies of the same content from accidentally diverging, and it is easier to maintain the project.

This template also uses the convention that Folder names begin with a capital letter.

All data goes into the Data folder and any subfolders.
All code goes into the Code folder or subfolders.
All results (figures, tables, computed values) go into Results folder or subfolders.
All products (manuscripts, supplement, presentation slides, web apps, etc.) go into Products subfolders.
See the various README.md files in those folders for some more information.

Template content

All of the folders contain template files that should filled with some content required for your analysis. Look inside each folder for template files that provide examples of the types of content that go there.

The template files also demonstrate how information is linked together across folders.

Please see the README.md files in each folder for more details.

This template has been populated with Palmer Penguins morphology and life history data in the Raw_data folder.
The Processing_code folder contains several files that load the raw data, perform a bit of cleaning, and save the result in the Processed_data folder.
The Analysis_code folder contains several files that load the processed data, do an exploratory analysis, and fit a simple model. These files produce figures and some numeric output (tables), which are saved to the results folder.
The Products folder contains an example bibtex and CSL style files for references. Those files are used by the example manuscript and slides.
The Manuscript folder contains a template for a report written as Quarto file. There is also a sub-folder containing an example template for a supplementary material file as is common in scientific articles these days.
The slides folder contains a basic example of slides made with Quarto.

Getting started

This is a Github template repository. The best way to get it and start using it is by following these steps.

Once you got the repository, you can check out the examples by executing them in order.

Starting with Project 1, look in Code>Processing_Code:

First run the processingcode.R, which will produce the processed data.
Then you can run the processingfile_v1.qmd and/or processingfile_v2.qmd. Processingfile_v2 requires edits to be made inthe processingcode.R script

Starting Project 2, look in the Code>Analysis_code:

Run the analysis scripts, which will take the processed data and produce some results.
Then you can run the manuscript, poster and slides example files in any order. Those files pull in the generated results and display them. These files also pull in references from the bibtex file and format them according to the CSL style.

Yokohime / Chiyoko_Project

readme