As expected a basic ratio based on total events per unique actor does not sufficiently indicate a project's true participation level. This calculation needs to be updated in two ways:
Factor in Event Types
Use modes instead of totals
Factor in Event Types
I recommend analyzing the event type distribution of the samples. What is the distribution of event types for projects that had false high or medium participation rates? How does that compare to the distribution of event types for projects that had true high or medium participation rates?
I have 2 ideas for this. One is to weight the count of events using fractions since that's the easiest to conceptualize. Create events might be worth .25, Watch events might be worth .5, while Issue events might be worth a full 1.
The other idea is to create ratios between the event types. So for instance if a common thing seen in the samples of false high/med participation projects is a high number of watch and fork events and a low number of issues, maybe create a ratio that is a sum of the watch and forks to the number of issues.
I would try both of these in the sample to see if they could effectively predict a false or true positive.
Use modes instead of totals
Recalculate total distinct actors, total events, and total events per type per month. Use a combination of modes and totals to see what ratio gives us the best indication of a true high/medium participation project. If there is a lot of variation, we may need to round or use a ratio to be able to use the mode effectively. Total actors over the 6 month period might be fine but number of events might work better as a mode. For instance the participation rate could then be calculated as (mode of total events per month/total distinct actors)/mode of total events per month. Anyways so the point of this is to play around with the calculations and see what gives us the best results for determining true high/med participation projects.
TODO
(this may need updating)
[ ] Repo Samples report that pulls GH data for the repos in ^^ to test participation rate accuracy
[ ] Participation rate re-proposal that refactors this metric and pulls new samples to compare with ^^
Ended up using a rounded natural log to compute participation rate. I'm going to close this for now because at this point we want something more qualitative than "high/med/low".
As expected a basic ratio based on total events per unique actor does not sufficiently indicate a project's true participation level. This calculation needs to be updated in two ways:
Factor in Event Types
I recommend analyzing the event type distribution of the samples. What is the distribution of event types for projects that had false high or medium participation rates? How does that compare to the distribution of event types for projects that had true high or medium participation rates?
I have 2 ideas for this. One is to weight the count of events using fractions since that's the easiest to conceptualize. Create events might be worth .25, Watch events might be worth .5, while Issue events might be worth a full 1.
The other idea is to create ratios between the event types. So for instance if a common thing seen in the samples of false high/med participation projects is a high number of watch and fork events and a low number of issues, maybe create a ratio that is a sum of the watch and forks to the number of issues.
I would try both of these in the sample to see if they could effectively predict a false or true positive.
Use modes instead of totals
Recalculate total distinct actors, total events, and total events per type per month. Use a combination of modes and totals to see what ratio gives us the best indication of a true high/medium participation project. If there is a lot of variation, we may need to round or use a ratio to be able to use the mode effectively. Total actors over the 6 month period might be fine but number of events might work better as a mode. For instance the participation rate could then be calculated as (mode of total events per month/total distinct actors)/mode of total events per month. Anyways so the point of this is to play around with the calculations and see what gives us the best results for determining true high/med participation projects.
TODO
(this may need updating)