datacamp / testwhat

Write Submission Correctness Tests for R exercises
https://datacamp.github.io/testwhat
GNU Affero General Public License v3.0
33 stars 24 forks source link

The question keeps timing out when I run the full solution #176

Closed benjamin-feder closed 6 years ago

benjamin-feder commented 6 years ago

For some reason, when I include my full SCT, the question times out, but when I only include parts, it's good to go. I don't know why it's doing that, and it's not giving any feedback that I can work with.

Make a summary plot of the number of daily rides with workweek / weekend days colored differently.

@instructions

@hint

@pre_exercise_code

.data_dir <- "/usr/local/share/datasets"
load(file.path(.data_dir, "bike.Rdata"))

@sample_code

library(dplyr)
library(ggplot2)

daily <- bike %>%
  ___ %>%
  summarise(n = ___)

ggplot(daily, ___) +
  ___

@solution

library(dplyr)
library(ggplot2)

daily <- bike %>%
  group_by(start_day, weekday) %>%
  summarise(n = n())

ggplot(daily, aes(start_day, n, color = weekday)) +
  geom_point()

@sct

#TODO: Why isn't this running?
ex() %>% check_library("dplyr")
ex() %>% check_library("ggplot2")
test_correct(ex() %>% check_object("daily") %>% check_equal(),
ex() %>% {
    check_function(., "group_by") %>% {
        check_arg(., ".data") %>% check_equal()
        check_result(.) %>% check_equal(incorrect_msg = "Did you use `group_by()` on the correct variables?", append=FALSE)
        }
    check_function(., "n")
    check_function(., "summarise") %>% {
        check_arg(., ".data") %>% check_equal()
        check_result(.) %>% check_equal(incorrect_msg = "Make sure you `summarise()` the correct statistic.", append=FALSE)
        }
    }
)

ex() %>% {
    check_function(., "ggplot") %>% check_arg("data") %>% check_equal()
    check_function(., "aes") %>% {
        check_arg(., "x") %>% check_equal(eval = FALSE)
        check_arg(., "y") %>% check_equal(eval = FALSE)
        check_arg(., "color") %>% check_equal(eval = FALSE)
        }
    check_function(., "geom_point")
    }
ex() %>% check_error()
# Interesting observations: bikes only available April - November (we'll include data up to November 2017 when available); weekends are not systematically more or less busy than weekdays.
filipsch commented 6 years ago

@benjamin-feder

I had a look at the course, and the fourth chapter in particular. You are dealing with a huge data set in the last chapter. I checked, bike is a data frame with 4018722 rows and 12 columns. That is over 48 million chunks of data.

image

Handling such large amount of data in the cloud is not straightforward for DataCamp's servers. Students gets about 800Mb of RAM to do their computations, and analyzing data like this is stretching it.

I did a commit to your review-ben branch of the course. It does two things:

  1. It sets the runtime_config: spark in the course.yml. This gives students more RAM when they are taking exercises in this course.
  2. It disabled the heaviest SCT functions in the exercise that you referenced. Doing the following:

    ex() %>% check_function('summarize') %>% check_result() %>% check_equal()

    sure is robust, but it's rerunning the summarize call in both the student environment and the solution environment. In this case, that means 2 extra extremely computationally heavy summarize(group_by(bike, ...)) calls. Disabling that last step makes sure the SCT can run within the timeout time again.

While it's fixed for now (you can submit the solution and it passes), the experience for students is not good. Code simply takes too long to execute. Just giving more resources to students is a very ad-hoc way of solving this, and shouldn't be the answer. Rather, I suggest you work with a subset of the data (random sample of 10%, for example).

I'm going to close the issue here, but I believe you should take this up either the instructor or @yashasroy, who seems to be responsible for this course.

Finally, some comments about your issue: it was a great first try, but I couldn't reproduce it as the pre-exercise-code refers to a data set (bike.RData) that is baked into the course image through requirements.r. I also didn't get a reference to the course on GitHub, on Teach or on campus. I managed to find it okay, but try to provide as much links as you can in the future. Thanks!