datacarpentry / r-raster-vector-geospatial

Introduction to Geospatial Raster and Vector Data with R
https://datacarpentry.org/r-raster-vector-geospatial
Other
112 stars 108 forks source link

change levels to unique in vector attributes lesson #342

Open lisamr opened 3 years ago

lisamr commented 3 years ago

In Explore and Plot by Vector Layer Attributes, the lesson is about seeing unique values and uses levels(lines_HARV$TYPE), which produces NULL because the column is not defined as a factor. I would suggest unique(lines_HARV$TYPE) instead.

jsta commented 3 years ago

I wonder if this is due to stringsAsFactors being FALSE by default in R > 4.0? #328 I think more than that one command would need to be changed because the surrounding text is all about factors and now lines_HARV$TYPE is no longer a factor :(

djhunter commented 3 years ago

Yes, I believe that @jsta is correct that this behavior is due to the change in the default value of stringsAsFactors in R version 4.0. I was going to submit a quick pull request, but then I realized that there are some pedagogical choices that need to be made.

Just changing levels() to unique() will fix the NULL output issue, but the larger problem is that there are several places in Episode 7 where lines_HARV$TYPE is referred to as a factor, which leads to a brief discussion of factors. This problem also comes up in Episodes 8 and 10. It seems to me that there are at least two ways to fix this:

  1. Change levels() to unique() in Episodes 7, 8, and 10, and update the exposition in Episodes 7 and 10 to remove any discussion of factors.
  2. Convert the strings to factors, and leave the exposition (mostly) the same.

I'd be happy to take care of this, but I need some advice about which of these options to choose. My inclination would be to go with Option (1), as it will simplify the lesson a little, and there doesn't seem to be any reason to convert the strings to factors for the purposes of visualizing the data. However, if there was a specific pedagogical reason to include a review of factors in this lesson, then Option (2) would be preferable.

jsta commented 3 years ago

I like option 1 as well. I don't think we have any ggplot code that relies on factors that would be my only hesitation.

djhunter commented 3 years ago

There is code that relies on the ordering of the factors. It still works if lines_HARV$TYPE is a character variable, because (I believe that) ggplot converts character variables to factors when they are used in aes(). So changing levels() to unique() might be slightly confusing in places like the following:

First we will check how many unique values the TYPE field has:

unique(lines_HARV$TYPE)

[1] "woods road" "footpath"   "stone wall" "boardwalk" 

Then we can create a palette of four colors, one for each feature in our vector object.

road_colors <- c("blue", "green", "navy", "purple")

We can tell ggplot to use these colors when we plot the data.

ggplot() +
  geom_sf(data = lines_HARV, aes(color = TYPE)) + 
  scale_color_manual(values = road_colors) +
  labs(color = 'Road Type') +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails") + 
  coord_sf()

The alert reader will notice that woods road is not colored blue, as might be expected, because the road_colors get assigned to the path types in factor (i.e., alphabetical) order, not in the order given by unique(). The same problem happens later when customizing line widths.

So now I'm starting to lean toward Option 2. It is natural to want to customize the order of things in plots, and you can't do that without grappling with factors.

We can recover the pre-version 4.0 behavior by adding stringsAsFactors = TRUE to all of the st_read commands. This is probably the simplest fix, as it doesn't involve changing as much of the exposition, and it will eliminate the confusion of some learners using pre-4.0 versions.

lisamr commented 3 years ago

Thanks all for picking up this issue. It seems like unique() would be a quick and dirty fix, but would lead to issues later on. It would also be a good thing for learners now about using stringsAsFactors = TRUE, since factors accidentally being treated as characters comes up in my own personal code all the time. I like @djhunter's explanation and solution.

jsta commented 3 years ago

After consideration, PR #353 seems like the "nuclear option" to me. It requires so much more typing on the learners' part. What about using unique to list line types and aes(color = factor(column_name, levels = road_colors)) in the plotting commands?

Then still discuss factors but move it to a better spot somewhere just before factor-plotting.

drakeasberry commented 3 years ago

What if we use options(stringsAsFactors = TRUE) to replicate the pre R 4.0 default? This would allow users running 4.0 to experience the lesson the same as users running pre 4.0 R versions. Then we would not need to add the stringsAsFactors = TRUE to each individual read command, which would reduce the amount of typing on the learner.

djhunter commented 3 years ago

According to this post, the stringsAsFactors global option will eventually be phased out, so setting it via the options command could lead to errors later when the phaseout happens.

There are only three read commands in which learners would have to type stringsAsFactors = TRUE: when reading HARV_roads.shp, HARV_PlotLocations.csv and hf001-06-daily-m.csv. All of the other changes in pull request #353 are just repetitions of these, which presumably learners won't have to repeat if they maintain their environments between episodes.

jonjab commented 2 years ago

We taught this lesson last month. stringsAsFactor = TRUE was not a big deal.

What was a bigger deal was running out of memory in our RStudio hub environment.