WFU-TLC / flc_discussion_board

A repository for discussing questions and issues in the Data Analysis with R (FLC)
https://wfu-tlc.github.io/
0 stars 0 forks source link

TIP: Keep variables, data table headers, AND code values in one_word format #11

Open adanieljohnson opened 5 years ago

adanieljohnson commented 5 years ago

Best practice for naming columns in data tables is to give each column a one-word or snake_case title. This makes it easier to call in the columns as variables. I learned this applies to the code values entered in the columns too.

At our last meeting I said I was having trouble using cor.test to get Pearson correlations on word frequencies. I could calculate it for one part of my dataset but other subsets failed to run properly. Jerid pointed out I used text with spaces and punctuation to code values in my CSV source file and suggested re-coding to simpler one-word terms. I used Search/Replace in Excel to switch my coding terms from/to:

Either extra spaces and punctuation was the problem, or I had a hidden typo, but simplifying the code terms solved the problem.

medewitt commented 5 years ago

@adanieljohnson great points and a good topic to discuss!

There is a function that can convert "improper" R names (e.g. spaces and invalid characters) to proper R names. It looks like the following:

names(your_data) <- make.names(names(your_data))

This function replaces all spaces with period and removes invalid characters. It is a quick trick to make proper title names.

Additionally, Karl Broman and Kara Woo wrote a neat journal article on organization of data in spreadsheets which is a great reference and located here. Both are avid R users AND biostats folks.

francojc commented 5 years ago

The janitor package also has a slew of functions for examining and cleaning data, including clean_names() to deal with non-conventional column names in data.frames.