Closed ilya-khadykin closed 6 years ago
This is interesting. I think anything we can do to simplify the initial scoring of exercises would be a big help, even if we end up having to manually tweak things.
I've come up with a somewhat straightforward solution (more like a proof of concept):
Not addressed issues with these approach:
The complexity caused by the interaction of topics can be tricky to calculate, and some may indicate 'AND' and others 'OR'.
My idea was to calculate difficulty based on the measured complexity of the solutions. My issue about this from the Ruby track: https://github.com/exercism/ruby/issues/663
radon
appears to be what you'd use in python, how do those scores compare to your topic based scores?
what to do if overall score is above 10
Normalize (all) the values so this isn't an issue.
Thanks for the link.
My idea was to calculate difficulty based on the measured complexity of the solutions.
Good idea. The example solution should be optimized as much as possible which is not always the case. But still, this idea is better
@Insti, how did you measure complexity?
Optimized for what? Reading? Speed? Idioms? Something else?
Optimized for what? Reading? Speed? Idioms? Something else?
It can be a controversial topic, but in this context it should be optimized in terms of speed. Anyway, we can measure performance of existing example solutions to get a sense of what difficulty score should be.
But once I read, that the example solution should be idomatic.
Most languages idioms I'm aware of, do prefer readability over speed unless the part is identified as bottleneck.
@Insti, how did you measure complexity?
I used a Ruby tool called flog
for which the main component of the score is cyclomatic complexity. Higher scores are more complicated solutions.
The example solution should be optimized as much as possible.
Why?
We're trying to put problems into 10 big (difficulty) buckets.
Whether an optimal solution scores 10 and an average solution scores 20 doesn't really matter as they'll end up in buckets close enough to where they should be.
Once they're there and it turns out that people think that a problem is easier or harder it can easily be moved.
It can be a controversial topic, but in this context it should be optimized in terms of speed.
That is usually a go-to definition of "optimized" for any solution: use less time and/or less resources than another solution. I would venture, given the audience and goals of Exercism, that our most optimal solutions should feature idiomaticity and clarity over all things.
When I try to figure out an exercise's difficulty for a track I do it in a most-algorithmically-unsuitable manner. I look at:
Unsure how this would be done in an automated way: it is a lot of gut-feeling. Perhaps the latter can be a diff
thing. But I do think somehow mining user submitted solutions could be helpful. I'd wager that if the last X number of submitted solutions to a problem average 100 LOC it would be a more difficult exercise than one where the average was 5 LOC. Or even if there is a high degree of variance between LOC in submissions for the same exercise on the same track. This could be very language dependent.
Also @m-a-ge the TOPICS.txt and the actual topics used in exercises can be quite divergent. I've been trying to normalize that information across tracks but still have a ways to go.
Overall this is a good avenue of approach you are taking it has a lot of promise, especially for less maintained tracks. Automate all the things! :+1:
Another option is to use user input the way ratings work on products on Amazon.
Getting user input in order to measure this is a useful idea, but it doesn't necessarily get around the problem of how you measure complexity - it just shifts that problem onto users. Given the variance in ability, knowledge and interpretation that may or may not be helpful. But I certainly like the idea of trying to be more data driven with these kinds of issues if it's feasible.
@jonmcalder As devs, we love to figure out an algorithm to solve a problem. But crowd sourcing information sometimes works better than algorithms.
Yes, variance is a problem. But then what we are trying to measure isn't necessarily objectively definable. So subjectivity might actually help, rather than harm.
I'm not saying that this is the way it should be done. Just that it's worth considering.
Ok cool - I think we're in agreement then, since I think it's worth considering.
FWIW, I'm a data scientist not a dev - so in I'm usually all for data driven approaches where feasible. I'm just aware that the quality of data determines what kind of results you can get out of it. And it may be hard to get meaningful feedback from users on exercise difficulty if we as maintainers can't even agree on what type of "difficulty measure" we want to reflect.
Another factor is that even if feedback was collected, we wouldn't have it upfront so this wouldn't be useful for initial setup but only for refining difficulty scores later.
I would also comment that what is very difficult for one user might be an immediately recognized pattern for another. If we went with a crowd-sourced approach (which is an idea I think is worth exploring,) maybe we should make sure people see the distribution of responses somehow. EG, so they have an idea about how many people thought it was difficult, moderate, easy etc.
OK. Here's how I'm thinking of exercise difficulty.
wrt Language features needed: Easy: Loops, strings, conditionals Hard: recursion, editing system classes, language specific idioms or methods
wrt Problem Concept Easy: School level math, String manipulation Hard: College level math, Needs domain understanding
wrt Algorithm Easy: find and transform, Hard: Multiple levels of abstraction, functional programming (easy for languages that are already functional)
Since the objective of Exercism to to teach a language rather than algos and complicated problem solving, the level of problems here are easy both wrt Problem Concept as well as Algorithm. This means we can think of exercise difficulty primarily from the language features needed point of view.
How can we measure it? Using something like Riiki over the sample solution to find out how many easy and how many hard language concepts are used.
Thoughts?
The example solution should be optimized as much as possible.
The example solution has no bearing on how a user may approach a problem. Sometimes as humans we make a problem harder without realizing it. That's how we learn :)
It's been a while since this discussion was active, but my gut sense is that our data is not good enough and the languages are different enough that an automated, data-driven solution is not going to be a particularly good approach for us.
That said, if anyone does manage to do this in their track, please share!
This issue is a follow-up for discussion in https://github.com/exercism/python/pull/523. This is closely related to #92, but I think @exercism/track-maintainers can compute the difficulty themselves instead for now.
I like the idea proposed by @lilislilit of summing up a difficulty score based on exercise topics. Topics themselves can be organized in Tiers which have respective scores. It's almost done in TOPICS.txt, so
Base Concepts
topics can have1
as the difficulty score. This will help us choose topics carefully since the difficulty will be a good indicator when very few topics were chosen or too many.What are your thoughts on this?
Other refs: