The question keeps timing out when I run the full solution

datacamp / testwhat

Write Submission Correctness Tests for R exercises

GNU Affero General Public License v3.0

33 stars 24 forks source link

For some reason, when I include my full SCT, the question times out, but when I only include parts, it's good to go. I don't know why it's doing that, and it's not giving any feedback that I can work with.

Make a summary plot of the number of daily rides with workweek / weekend days colored differently.

@instructions

Create a dataset of daily counts using group_by() and summarise(). Group by start_day and be sure to include the variable weekday as well.
Plot the result, using points to plot the number of rides on the y-axis and the start day on the x-axis. Color the points by weekday.

@hint

@pre_exercise_code

.data_dir <- "/usr/local/share/datasets"
load(file.path(.data_dir, "bike.Rdata"))

@sample_code

library(dplyr)
library(ggplot2)

daily <- bike %>%
  ___ %>%
  summarise(n = ___)

ggplot(daily, ___) +
  ___

@solution

library(dplyr)
library(ggplot2)

daily <- bike %>%
  group_by(start_day, weekday) %>%
  summarise(n = n())

ggplot(daily, aes(start_day, n, color = weekday)) +
  geom_point()

@sct

#TODO: Why isn't this running?
ex() %>% check_library("dplyr")
ex() %>% check_library("ggplot2")
test_correct(ex() %>% check_object("daily") %>% check_equal(),
ex() %>% {
    check_function(., "group_by") %>% {
        check_arg(., ".data") %>% check_equal()
        check_result(.) %>% check_equal(incorrect_msg = "Did you use `group_by()` on the correct variables?", append=FALSE)
        }
    check_function(., "n")
    check_function(., "summarise") %>% {
        check_arg(., ".data") %>% check_equal()
        check_result(.) %>% check_equal(incorrect_msg = "Make sure you `summarise()` the correct statistic.", append=FALSE)
        }
    }
)

ex() %>% {
    check_function(., "ggplot") %>% check_arg("data") %>% check_equal()
    check_function(., "aes") %>% {
        check_arg(., "x") %>% check_equal(eval = FALSE)
        check_arg(., "y") %>% check_equal(eval = FALSE)
        check_arg(., "color") %>% check_equal(eval = FALSE)
        }
    check_function(., "geom_point")
    }
ex() %>% check_error()
# Interesting observations: bikes only available April - November (we'll include data up to November 2017 when available); weekends are not systematically more or less busy than weekdays.

@benjamin-feder

I had a look at the course, and the fourth chapter in particular. You are dealing with a huge data set in the last chapter. I checked, bike is a data frame with 4018722 rows and 12 columns. That is over 48 million chunks of data.

Handling such large amount of data in the cloud is not straightforward for DataCamp's servers. Students gets about 800Mb of RAM to do their computations, and analyzing data like this is stretching it.

I did a commit to your review-ben branch of the course. It does two things:

It sets the runtime_config: spark in the course.yml. This gives students more RAM when they are taking exercises in this course.
It disabled the heaviest SCT functions in the exercise that you referenced. Doing the following:
```
ex() %>% check_function('summarize') %>% check_result() %>% check_equal()
```
sure is robust, but it's rerunning the summarize call in both the student environment and the solution environment. In this case, that means 2 extra extremely computationally heavy summarize(group_by(bike, ...)) calls. Disabling that last step makes sure the SCT can run within the timeout time again.

While it's fixed for now (you can submit the solution and it passes), the experience for students is not good. Code simply takes too long to execute. Just giving more resources to students is a very ad-hoc way of solving this, and shouldn't be the answer. Rather, I suggest you work with a subset of the data (random sample of 10%, for example).

I'm going to close the issue here, but I believe you should take this up either the instructor or @yashasroy, who seems to be responsible for this course.

Finally, some comments about your issue: it was a great first try, but I couldn't reproduce it as the pre-exercise-code refers to a data set (bike.RData) that is baked into the course image through requirements.r. I also didn't get a reference to the course on GitHub, on Teach or on campus. I managed to find it okay, but try to provide as much links as you can in the future. Thanks!

datacamp / testwhat

The question keeps timing out when I run the full solution #176