BIOL548O / Discussion

A repository for course discussion in BIOL548O
0 stars 0 forks source link

Homework 3 clarification #36

Open sandraemry opened 7 years ago

sandraemry commented 7 years ago

Hi @aammd

Do we add assertions to our script that cleans the raw data? Or should I read in my tidy data set and write assertions for that one?

Thanks!

Sandra

aammd commented 7 years ago

HI @sandraemry , that is a good question! I think both are just fine. Just make sure your reviewer knows where to find the assertions -- perhaps by labelling that section in your R script with a large comment

katcheung commented 7 years ago

Hi @aammd, I tried to read in my tidied data and verify that certain columns were set up as factor but it fails. I verified with my original scripts from tidying my data that it was set up correctly. I seem to lose that information (eg. tank.no as a factor, etc.) in my saved csv file. Is this normal? If so, would we assume that we're continuing to work with the final product of our tidied script (assignment 2) and not reading in our tidied data csv file? Sorry if this is confusing. Thanks, Katherine

sandraemry commented 7 years ago

Hi @katcheung, you can read in your csv files with the columns specified with the type of data it is. So for me it would look like this:

mydata <- read_csv("./data/flowcam_sum_tidy.csv", col_types = cols( temp = col_integer(), litter = col_factor(c("H", "L")), rep = col_integer(), cell_density = col_integer(), cell_volume = col_double(), biomass = col_double() ))

Is that what you were asking about? Or maybe @aammd has a better solution?

aammd commented 7 years ago

Hi @sandraemry & @katcheung ,

I think Sandra has a good answer here! You're right, factors are created when a csv or other file is read into R. So if you change the way you are reading the file, you change the way the result is represented in R. Sandra's example code shows one way to control exactly how each column is read.

Another answer to your question @katcheung is that you can choose to work in a clean script (reading in your tidy CSV) or on the bottom of your old one. Just make sure it is clear for your peer reviewer.

LinneaSandell commented 7 years ago

@aammd Regarding the metadata, should we have it as a routine to only work with files with metadata? As an example, should I save all my datafiles as csvy? It doesn't seem very useful to have metadata only for one part of your script (you add metadata in 01_rscript, but read in the data as csv in 02_analyse_data? Let me know for what files metadata should be attached, and when it it is optional. Thank you.

aammd commented 7 years ago

@LinneaSandell this is an interesting question, and one we should return to in class! Briefly, I think that we are drawing a distinction here between "in progress" data and the "final version" of the dataset. So we add metadata only when we are "happy" with the way the dataset is organized. However, there are many other workflows that could be imagined, where metadata is created at the beginning, or in the middle, of a project