General formula for computing exercise difficulty

ilya-khadykin commented 7 years ago

This issue is a follow-up for discussion in https://github.com/exercism/python/pull/523. This is closely related to #92, but I think @exercism/track-maintainers can compute the difficulty themselves instead for now.

I like the idea proposed by @lilislilit of summing up a difficulty score based on exercise topics. Topics themselves can be organized in Tiers which have respective scores. It's almost done in TOPICS.txt, so Base Concepts topics can have 1 as the difficulty score. This will help us choose topics carefully since the difficulty will be a good indicator when very few topics were chosen or too many.

What are your thoughts on this?

Other refs:

https://github.com/exercism/discussions/issues/167 (recent discussion concerning Nextercism)
https://github.com/exercism/discussions/issues/60

kytrinyx commented 7 years ago

This is interesting. I think anything we can do to simplify the initial scoring of exercises would be a big help, even if we end up having to manually tweak things.

ilya-khadykin commented 7 years ago

I've come up with a somewhat straightforward solution (more like a proof of concept):

https://nbviewer.jupyter.org/gist/m-a-ge/03185c4ab401c23d5b559bc28d96b659

Not addressed issues with these approach:

how to correctly compute overall difficulty score if more than one topic is provided for one Tier (I've summed them up).
what to do if overall score is above 10

Insti commented 7 years ago

The complexity caused by the interaction of topics can be tricky to calculate, and some may indicate 'AND' and others 'OR'.

My idea was to calculate difficulty based on the measured complexity of the solutions. My issue about this from the Ruby track: https://github.com/exercism/ruby/issues/663

radon appears to be what you'd use in python, how do those scores compare to your topic based scores?

what to do if overall score is above 10

Normalize (all) the values so this isn't an issue.

ilya-khadykin commented 7 years ago

Thanks for the link.

My idea was to calculate difficulty based on the measured complexity of the solutions.

Good idea. The example solution should be optimized as much as possible which is not always the case. But still, this idea is better

@Insti, how did you measure complexity?

NobbZ commented 7 years ago

Optimized for what? Reading? Speed? Idioms? Something else?

ilya-khadykin commented 7 years ago

Optimized for what? Reading? Speed? Idioms? Something else?

It can be a controversial topic, but in this context it should be optimized in terms of speed. Anyway, we can measure performance of existing example solutions to get a sense of what difficulty score should be.

NobbZ commented 7 years ago

But once I read, that the example solution should be idomatic.

Most languages idioms I'm aware of, do prefer readability over speed unless the part is identified as bottleneck.

Insti commented 7 years ago

@Insti, how did you measure complexity?

I used a Ruby tool called flog for which the main component of the score is cyclomatic complexity. Higher scores are more complicated solutions.

The example solution should be optimized as much as possible.

Why?

We're trying to put problems into 10 big (difficulty) buckets.

Whether an optimal solution scores 10 and an average solution scores 20 doesn't really matter as they'll end up in buckets close enough to where they should be.

Once they're there and it turns out that people think that a problem is easier or harder it can easily be moved.

tleen commented 7 years ago

It can be a controversial topic, but in this context it should be optimized in terms of speed.

That is usually a go-to definition of "optimized" for any solution: use less time and/or less resources than another solution. I would venture, given the audience and goals of Exercism, that our most optimal solutions should feature idiomaticity and clarity over all things.

When I try to figure out an exercise's difficulty for a track I do it in a most-algorithmically-unsuitable manner. I look at:

the number of solutions that users submitted which were basically non-tries just to get to the next exercise,
the level of understanding users seemed to have of the submission (was the concept itself confusing?),
the number of submissions that were basically copies of the repository example solution (with slight variable names changes and the like)

Unsure how this would be done in an automated way: it is a lot of gut-feeling. Perhaps the latter can be a diff thing. But I do think somehow mining user submitted solutions could be helpful. I'd wager that if the last X number of submitted solutions to a problem average 100 LOC it would be a more difficult exercise than one where the average was 5 LOC. Or even if there is a high degree of variance between LOC in submissions for the same exercise on the same track. This could be very language dependent.

Also @m-a-ge the TOPICS.txt and the actual topics used in exercises can be quite divergent. I've been trying to normalize that information across tracks but still have a ways to go.

Overall this is a good avenue of approach you are taking it has a lot of promise, especially for less maintained tracks. Automate all the things! :+1:

samir-majhi commented 7 years ago

Another option is to use user input the way ratings work on products on Amazon.

jonmcalder commented 7 years ago

Getting user input in order to measure this is a useful idea, but it doesn't necessarily get around the problem of how you measure complexity - it just shifts that problem onto users. Given the variance in ability, knowledge and interpretation that may or may not be helpful. But I certainly like the idea of trying to be more data driven with these kinds of issues if it's feasible.

samir-majhi commented 7 years ago

@jonmcalder As devs, we love to figure out an algorithm to solve a problem. But crowd sourcing information sometimes works better than algorithms.

Yes, variance is a problem. But then what we are trying to measure isn't necessarily objectively definable. So subjectivity might actually help, rather than harm.

I'm not saying that this is the way it should be done. Just that it's worth considering.

jonmcalder commented 7 years ago

Ok cool - I think we're in agreement then, since I think it's worth considering.

FWIW, I'm a data scientist not a dev - so in I'm usually all for data driven approaches where feasible. I'm just aware that the quality of data determines what kind of results you can get out of it. And it may be hard to get meaningful feedback from users on exercise difficulty if we as maintainers can't even agree on what type of "difficulty measure" we want to reflect.

Another factor is that even if feedback was collected, we wouldn't have it upfront so this wouldn't be useful for initial setup but only for refining difficulty scores later.

matthewmorgan commented 7 years ago

I would also comment that what is very difficult for one user might be an immediately recognized pattern for another. If we went with a crowd-sourced approach (which is an idea I think is worth exploring,) maybe we should make sure people see the distribution of responses somehow. EG, so they have an idea about how many people thought it was difficult, moderate, easy etc.

samir-majhi commented 7 years ago

OK. Here's how I'm thinking of exercise difficulty.

wrt Language features needed: Easy: Loops, strings, conditionals Hard: recursion, editing system classes, language specific idioms or methods

wrt Problem Concept Easy: School level math, String manipulation Hard: College level math, Needs domain understanding

wrt Algorithm Easy: find and transform, Hard: Multiple levels of abstraction, functional programming (easy for languages that are already functional)

Since the objective of Exercism to to teach a language rather than algos and complicated problem solving, the level of problems here are easy both wrt Problem Concept as well as Algorithm. This means we can think of exercise difficulty primarily from the language features needed point of view.

How can we measure it? Using something like Riiki over the sample solution to find out how many easy and how many hard language concepts are used.

Thoughts?

Stargator commented 6 years ago

The example solution should be optimized as much as possible.

The example solution has no bearing on how a user may approach a problem. Sometimes as humans we make a problem harder without realizing it. That's how we learn :)

kytrinyx commented 6 years ago

It's been a while since this discussion was active, but my gut sense is that our data is not good enough and the languages are different enough that an automated, data-driven solution is not going to be a particularly good approach for us.

That said, if anyone does manage to do this in their track, please share!

exercism / discussions

General formula for computing exercise difficulty #193