cherrypi / Science-Fair_2019

Vernal Pond graphing and data, as well as data analysis.
1 stars 0 forks source link

Some of your data is "stuck" as characters #8

Closed VCF closed 5 years ago

VCF commented 5 years ago

So you're noticing that some of your numeric columns are misbehaving, and showing up as characters rather than numbers.

When read.table loads a file, it makes a "best guess" as to what each column is. It will try a series of data types, in this order: logical, integer, numeric, complex - if none of those can be consistently applied to all values, it will give up and use character as a "junk drawer" "I have no idea" category (you can read more about it with ?type.convert.

So if you see a number column showing up as character, it means R has found something in that column that ain't a character. You need to scan through those columns and make sure you don't have "extra" stuff in them. You have an $OtherObservations "free-form" notes column, move any comments to there.

Don't worry about the dates for the moment, that's a bit more complex, we'll deal with that later.

VCF commented 5 years ago

You made some progress in 69178bc, but you still have numeric columns that are showing up as characters. This little chunk of code looks at the data type ("Storage Mode") for each of your columns - you can see that some numeric columns are still showing as characters:

vapply(colnames(Pond_Data), function(n) storage.mode(Pond_Data[[ n ]]), "")
             Date              Rain             Depth             South 
      "character"       "character"          "double"          "double" 
            North              West              East    TemperatureMax 
         "double"          "double"          "double"       "character" 
   TemperatureMin OtherObservations 
      "character"       "character" 

You can explore this problem in R. Now that you have your data as a data.frame, you can use handy "accessors" to pull out parts of your data. R is vigorously centered around the concept of a vector; a 1D set of zero or more values. Your data.frame is just a collection of vectors of the same length, with each column being a vector. You can pull out columns in one of three ways:

Pond_Data[[ 2 ]] # Pull out the second column
Pond_Data[[ "Rain" ]] # Pull out the second column, which is named "Rain"
Pond_Data$Rain # As above, but easier/faster when on the command line

Use the above to inspect each of your should-be-numeric-but-is-instead-character columns to find out what's wrong.

Also note - R counts from 1, unlike most every other non-laughing-stock language which counts from 0.

VCF commented 5 years ago

Ooo, I forgot one:

Pond_Data[ ,2] # Pull out all rows from the second column

The single square bracket is the general way to access a data.frame or an array/matrix. You can use it to "cherry pick" values:

Pond_Data[21,2] # Pull out a single value from row 21, column 2
Pond_Data[4:10,2] # Pull out a vector of rows 4 through 10 from column 2
Pond_Data[4:10,1:2] # Gimme a mini-data frame, rows 4-10, columns 1-2
Pond_Data[4:10,c("Date","Rain")] # Same as above, using names
Pond_Data[c(3,41,2),c("Rain","TemperatureMax","Date")] # Weee, grab random things in random order

The c() function is used A LOT. IT's a way to build an arbitrary vector, and is often used as I have above, to specify a particular selection from a larger data structure.

VCF commented 5 years ago

... aaannnddd, finally, another "gotcha" when R is trying to "help" you:

Pond_Data[1:3,1:3] # We asked for 2 dimensions, R gives us 2 dimensions (another data.frame)
Pond_Data[[ 3 ]] # We asked for 1D (a single column), R gives us 1D
Pond_Data[1:3, 2] # We asked for ... what exactly? R has given us 1D!

In the last example above, we're using the [ ] single bracket accessor, which is normally used to recover multi-dimensional subsets of the data. But R notices that one dimension (the column request) has only a single value, so it decides to be "helpful" and "drop" that dimension, so instead of getting a 2D subset, you get a 1D vector. Sometimes this is what you want. But sometimes this causes major headaches. To prevent this from happening, there's a parameter you can add to [ ]:

Pond_Data[1:3, 2, drop=FALSE] # We're telling R not to 'drop' any of the dimensions, so we keep 2D
VCF commented 5 years ago

Per your trouble finding what's wrong with $TemperatureMin:

# In a CSV file, R will not treat these two rows the same - they have different content!
1,,3
1,NA,3
VCF commented 5 years ago

Ok, efaa344 has fixed the issue where you had a missing column in one row.