Matchmaking system - Githubissues

SteveDesmond-ca commented 3 years ago

As a runner, I want to know which other runners are approximately the same speed as me on a given course, so that I can see if they want to run together.

Acceptance criteria:

athlete page contains a "Similar Athletes" link for each course
"Similar Athletes" page for a given athlete and course lists any runner (regardless of gender) who has a fastest and average time within a certain threshold (try 10% to start?)
no contact details are provided in the app, users should be directed (via external announcement) to use the forum for communicating with other runners

adamengst commented 3 years ago

Is it too granular to break it up by course?

SteveDesmond-ca commented 3 years ago

I guess it depends what we're going for: best matches for each course ("I'm going to run [course X] this week, who should I run with?"), vs overall matches based on averaging the differences per course ("who might be a good partner for all my runs?")...those lists might end up pretty different.

adamengst commented 3 years ago

I guess that's the question—would it be that different? Perhaps it's just my data-free blinders, but it feels to me like someone who would be similar to me on the East Hill Rec Way would also probably be similar on Thom B.

From a user experience point of view, it might be too detailed too. @scottpdawson, do you have any thoughts about this idea (which Steve and I discussed on a Thom B. run last weekend)?

scottpdawson commented 3 years ago

I love that you’re thinking about something like this! I would want it on my own athlete page, not a special page. For each course, my “performances” might vary quite a bit, but it’d be cool to know who’s average is close to my average pace on that course. Too prohibitive to calculate that? Overall, at the bottom, would be very cool to have a section that shows similar age-graded times, irrespective of courses. Would that cross age groups? I think it should, ideally, since I might have fun training with someone in a different age group simply because our times are similar. I know that’s how some FLRC people roll — running w/ people in similar pace groups, not necessarily similar age groups.

adamengst commented 3 years ago

Oh yeah, speaking as a 53-year-old, I couldn't care less how old someone is if they're my pace. Though it would be useful to know that information since it will inform the conversation. :-)

I'm not sure age-graded times would be helpful here, since the point of age grading is to level out age and gender. So a 60-year-old woman might have a very similar age grade to me, but she'd be running much more slowly overall.

My main goal here is to use our data to make it clear who are compatible running partners so those who don't know a lot of people can see where they might fit in.

adamengst commented 3 years ago

I've been longing for this a little more lately as I run across people who I should be inviting for weekend runs, for instance, but who I don't know well and tend to forget each time.

SteveDesmond-ca commented 3 years ago

I just dark launched this, your list is available at https://challenge.fingerlakesrunners.org/Athlete/Similar/6728 and you can view anyone else's by going to their athlete info page and changing Index to Similar in the URL (e.g. https://challenge.fingerlakesrunners.org/Athlete/Index/115406 becomes https://challenge.fingerlakesrunners.org/Athlete/Similar/115406 for mine)

scottpdawson commented 3 years ago

Wow! This is really fascinating. Great job, Steve!

adamengst commented 3 years ago

Looks cool, indeed! What does it mean when it says "no data" for average pace"?

And does the confidence go up more with the number of courses run than anything else?

I think this should default to sorting by similarity, not confidence, since it's immediately weird to get a confidently wrong result at the top. :-)

But when I sort that way, the data looks pretty good. More when I have time to look in detail.

SteveDesmond-ca commented 3 years ago

(no data) means there are no courses in common that you both have run enough times to get a "best average" result
"confidence" is essentially the percentage of courses (where "fastest" and "best average" are tracked separately, for a potential maximum of 20) that you have in common...would "overlap" be a better term for this?
the rank (and default sort) is a combination of unweighted (raw "similarity") and weighted ("similarity" times "confidence") so that if someone has run one course once and finishes with a very similar time as you, they don't show up at the very top of the list, as is the case on mine...right now the ranking ratio is 2:1 unweighted:weighted, but my guess is 3:1 or 4:1 is probably better -- I'll take some screenshots of the various combinations for each of us and we can figure out where the sweet spot is

adamengst commented 3 years ago

@scottpdawson, there have been some background updates so take a look now. Overall, this is looking really good, and when I spin through different people, I get the results that I more or less expect. A few responses:

Let's change Confidence to Overlap. I think that clarifies that it's running the same courses better.
Let's try some light shading based on how much faster or slower someone is, as you suggested, Steve.
What do you think about using the term "Pace Partners" for links to it? There's also "Pace Pals" but that implies more, I think.
How should it be linked in? Probably from the Athlete Page, perhaps just with a button?

I'll write something up to explain it before we take it fully live.

scottpdawson commented 3 years ago

Sweet. This takes all courses into consideration, @stevedesmond-ca? Can we offer a way to narrow by course? Some courses I've tried my damndest one, but some I have not. I imagine that might be the case for others, so a truer metric would come from looking at my Black Diamond only, for example. Could be interesting. Regardless, this is fantastic. And yes @adamengst a button from Athlete page would do nicely.

SteveDesmond-ca commented 3 years ago

It only considers courses you (or the runner you're looking at) have run, and does take "preferred" courses into account inasmuch as if you've run enough on a course to get a "best average" time, that course essentially counts for double compared to courses you haven't run much on yet.

The original consensus seemed like per-course matches weren't as valuable initially, partly since they're easier to discern from just looking around you on the course results lists, but I can see this subsequently expanding to per-course, for a more athlete-centric view of things.

On Wed, Jun 30, 2021 at 6:26 PM Scott Dawson @.***> wrote:

Sweet. This takes all courses into consideration, @stevedesmond-ca https://github.com/stevedesmond-ca? Can we offer a way to narrow by course? Some courses I've tried my damndest one, but some I have not. I imagine that might be the case for others, so a truer metric would come from looking at my Black Diamond only, for example. Could be interesting. Regardless, this is fantastic. And yes @adamengst https://github.com/adamengst a button from Athlete page would do nicely.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FingerLakesRunnersClub/ChallengeDashboard/issues/70#issuecomment-871767272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP2OEFLXWGSJHWTXFMYNLTVOK3BANCNFSM45BNEBYQ .

adamengst commented 3 years ago

We don't seem to have a particularly technically demanding audience, but if we launch the basic Pace Partners concept and people go "Whoa, this is great—now can it tell me who I should run with for my half marathon specifically?", it would seem pretty easily extended.

scottpdawson commented 3 years ago

This all sounds great!

SteveDesmond-ca commented 3 years ago

I've added some gradient/strength-based shading to the table, based on how much faster/slower someone is. It's a little more subtle than I'd like, but given the wide range of numbers possible and my lack of desire to find a logarithmic function (it's currently linear), I wanted to make sure the contrast ratio was always accessible, particularly the blue link on a blue background.

If we think this is good enough for an initial public launch, I can add the button to the athlete info page and we can announce it in this week's recap. On that note, I know it's a small and fairly tight-knit community, but it might be good to include a reminder of the code of conduct (pretty sure I saw that one exists somewhere) given that an expected outcome of this feature is athletes contacting each other.

adamengst commented 3 years ago

Let's do a softer launch next week—say on Tuesday—so I can write up an explanation of it beforehand and then we can announce more widely in the following recap. I do need to think a little about the code of conduct stuff—I think what you may have seen was the Discourse boilerplate stuff. We do have an anti-harassment policy that's part of our diversity policy.

https://fingerlakesrunners.org/diversity-statement/

Part of my hesitation with today is just that I'm still fighting this babesiosis nonsense, and running a few miles today has knocked the stuffing out of me. As the fatigue sets in, things suddenly seem too hard. :-)

adamengst commented 3 years ago

Let's launch this Tuesday of next week—I lost all day yesterday to this damn babesiosis (spent all day in bed) and that will give me time to write up the description.

SteveDesmond-ca commented 3 years ago

:+1: I've got the button on the athlete info page ready to deploy on-demand!

adamengst commented 3 years ago

As I'm writing this up and looking at the data, I'm wondering if Average Pace should be a true raw average, rather than be restricted to the number of runs necessary to compete in the Best Average for a course. Restricting it makes sense when we don't want to let someone win an award with just one run making up their average, but when we're calculating similarity, it would seem like taking all the runs into account would be better than just punting and displaying "no data." Someone's average might be dragged down a bit by some slow runs, but that seems like a lesser evil.

That might help explain one seeming anomaly. Caitlin Loehr and Pete Kresock would seem to be very similar. They've both run all the courses, so should have a high degree of overlap. But Pete is only 16th for Caitlin, and Caitlin is 6th for Pete.

https://challenge.fingerlakesrunners.org/Athlete/Similar/115318 https://challenge.fingerlakesrunners.org/Athlete/Similar/24017

More confusing, on Caitlin's page, Pete's Fastest Pace is faster, but his Average Pace is slower, whereas on Pete's page, both Caitlin's Fastest Pace and Average Pace are slower. And they have different Similarity and Overlap numbers too, which doesn't quite make sense.

Or maybe there's more going on here...

adamengst commented 3 years ago

I think switching to a raw Average Pace would also significantly increase Overlap, which might result in some more accurate rankings too. For instance, Casey Carlstrom and Keith Eggleston are both high on my list of Similarity (in part because we did several of the runs together), but even though they've both run a subset of the courses I have, they have low Overlap numbers. (And for what it's worth, historically, both should be quite similar to me.)

https://challenge.fingerlakesrunners.org/Athlete/Similar/6728

SteveDesmond-ca commented 3 years ago

Part of the issue is that there's a lot of data and math going into this, and only so much we can display before it becomes information overload on the user (it's already on the verge of it I think).

The main difference between the "pace" fields and the "similarity" is that the latter is an absolute difference, whereas the former is just an average. Here's an example of why that's an important distinction, using just the 3 shortest courses for simplicity:

You run a 6:00 mile together with A and B, so you all have the exact same time; you then run 15:00 at the arboretum and 20:00 at Stewart Park, A runs 16:30 and 18:00 respectively (they hate hills) and B runs 15:45 and 21:00.

Course	You	A	B
Mile	6:00	6:00	6:00
Arboretum	15:00	16:30	15:45
Waterfront	20:00	18:00	21:00

A is going to end up with a "Fastest Pace" exactly the same as yours (0.0% faster/slower) because their arboretum (10% slower) and Stewart Park (10% faster) cancel/average each other out, but the absolute difference of those averages to 93% per course (100 - (0 + 10 + 10)/3)

B will have a "Fastest Pace" of 3.3% slower (average of 0 + 5% slower + 5% slower) but a similarity of 97% because the absolute difference of their pace is much closer to yours.

Similarly, for someone like me who dries out like a slug in the sun on courses like Black Diamond or South Hill when it's anywhere hotter than 70 degrees, but thrives on the rolling downhills of Danby and Frolic, that's going to come through in the absolute difference in "Similarity" that a straight average across all courses isn't going to show as clearly. In a way, this is kind of like an "extra light" version of what @jeanlucj is working on.

Looking at Caitlin and Pete, as you mentioned, is a good example because that's where a lot of the details come out. The discrepancy in their relative rankings on each others' lists does initially raise some eyebrows, but seeing the differences in their data can explain the reasoning behind it.

Since Caitlin is slightly slower overall, her list is going to pick up people within the 5% threshold that Pete's does not (and vice versa, depending on the specific course). If those people have much closer times to Caitlin than they do to Pete (e.g. they're very slightly slower than Caitlin overall), they're going to show up higher in her list.

This is where our determination of how important "Overlap" is comes into play. Caitlin's list has several people that have only run a couple courses in common, but have run closer to her pace than Pete, even though Pete has more evidence of similarity (that's kind of what "overlap" is, right? "evidence of similarity"?). Eric Sambolec is a great example of that: he's run CBG once in 17:10, to Caitlin's 17:04...is that enough information to rank him as her 6th place currently? We've backed the "weighted similarity" (the one that considers "overlap") down to 10% of the overall ranking, but bumping it back up would increase Pete's ranking on her list while moving all those 10-20% overlap ones down.

Here's a more human-readable version of the "overlap" algorithm, with you and Casey as an example:

you've run 7 courses, 4 of which enough times to get a "best average" time, for a total of 11 possible "overlap points" (or "potential points of overlap")
of those 11 points, Casey has 4 of the same ones
4 / 11 = 36%

This also should help explain why you're his 1st match with 100% overlap: of his 4 available "overlap points", you have all 4: 4/4 = 100%. Based on the data available, you're his best match but he's not yours. You both have the same similarity value for each other, but different rankings, because there's more information to determine other matches for you, but not (yet) for him.

SteveDesmond-ca commented 3 years ago

As we continue to evolve this, it may also be worth talking to @dougturnbull, who specializes in recommendation systems.

adamengst commented 3 years ago

OK, I understand what's going on better now, but to get back to my original question, does limiting Average Pace calculations to people who have hit the magic average number for a particular course make sense in this context?

Cool that Doug does this sort of thing—I had no idea what he did beyond triathlons. :-)

SteveDesmond-ca commented 3 years ago

I think it's slightly muddied because we've told people there's no penalty for running a course slowly, that only the top X times will be considered for averaging, but now there is a penalty in that match quality will be affected by those slower runs -- e.g. I walked the Waterfront Trail course last month as a leisurely stroll with kids in tow, stopping to watch the firefighter training and ospreys fishing. So my average for Waterfront Trail right now is 31:05...I personally wouldn't want that throwing off my entire match dataset, as I was under the impression that the time would be ignored but I'd still get credit for the mileage, and such a time isn't really indicative of what a small group run with me would be like.

Plain averages are bad like that, but without doing some more statistics heavy lifting with standard deviations (no thank you :sweat_smile:) I think keeping "best average" as the secondary contributor to "similarity" produces the best data quality for the least effort, since that metric is already readily available. I definitely think that as participants run more and we get more data (both quantitative numbers and qualitative feedback) we can tweak things to try to provide everyone with better matches!

adamengst commented 3 years ago

OK, that's fair—I wasn't thinking about the range being that great. Now if only we could persuade Doug Turnbull to enter the Challenge and then get him interested in all the data we have here... :-)

dougturnbull commented 3 years ago

Hi Steve and Adam,

(I apologize in advance if I am replying-all on this thread.)

The dynamic leaderboard is very cool. Nice work. Normalizing and imputing data can be confusing and adds a source of potential bias to the results. I'm happy to talk shop if you need to bounce ideas.

Given Steve's concern about "slow" runs, I suspect that the data will be skewed-right (i.e., Poisson distributed https://en.wikipedia.org/wiki/Poisson_distribution) for an individuals since it is hard to shave a few seconds off a PR but it is easy to go out for a slow training run. So a runner's median time on a course might be a better (and faster) summary statistic than some sort of mean time. But in terms is easy-of-interpretation, just retaining the fastest single time is probably the way to go.

For me, I tend to be pretty low-tech when it comes to running. The less I have to do in terms of keeping track of data, the better!

All the best, Doug

On Mon, Jul 12, 2021 at 4:20 PM Adam Engst @.***> wrote:

OK, that's fair—I wasn't thinking about the range being that great. Now if only we could persuade Doug Turnbull to enter the Challenge and then get him interested in all the data we have here... :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FingerLakesRunnersClub/ChallengeDashboard/issues/70#issuecomment-878568069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFRTGENVWBMD644DTH3OTLTXNFA5ANCNFSM45BNEBYQ .

--

Douglas Turnbull Associate Professor Computer Science Ithaca College https://dougturnbull.org/ http://jimi.ithaca.edu/%7Edturnbull

adamengst commented 3 years ago

Thanks for the detail, Doug! What you say makes sense, and I'll let Steve ponder it for future tweaks. :-)

SteveDesmond-ca commented 3 years ago

This seems to be working well enough, any future enhancements can be created as new issues :rocket:

FingerLakesRunnersClub / Leaderboards

Matchmaking system #70