Find a good metric - Githubissues

adrian-pace commented 6 years ago

That could encapsulate what we display on the graphs in a value

lbaligand commented 6 years ago

After going through the first 150 pads, here is some interesting pads: good_start good_start2 In this pad, we see that there is a good collab in the first few lines (users present themselves) and then only one person is writing.

Two examples of bad collab (pad:640911640911640911640911640911640911 and 753268753268753268753268753268753268)

One example of good collab (pad:133522133522133522133522133522133522):

We should take into consideration in the metrics:

+++ overall proportion of users in the pad is balanced or not?
++ write
++ delete/edit
- paste
0 jump
- If number of authors is greater than 2 ( with the admin)
++++ sync/async
- Breaks, be able to tell if the pad was written all in once or in several times
++++ A good collab is when the main authors of a paragraph is alternating
-- When one half is written by one author and the author one by another author
---- when one author write everything

lbaligand commented 6 years ago

Let n be the number of authors different from 'Etherpad_admin', p_i being the proportion written by the author i. The proportion score should be between 0 and 1 and it should be maximized when the pis are balanced. The two ways to compute the score would be to use $\prod{i=1}^{n} pi$ or $\sum{i=1}^{n} p_i log_n(1/p_i). For the synchronized score we can simply take the proportion of synchronized writing in the pad. For the breaks we should compute the overall length of the pad and compute a proportion of break. One baseline would be to divide the number of small break multiplied by 100 * avg length of one word (around 7) by the overall length of the pad and do the same for long pad but multiply the breaks by 1000. We still need to find a way to evaluate if the paragraph authors are alternating.

We can finally do a weighted average of all these scores given their importance.

adrian-pace commented 6 years ago

good idea for entropy, also divide by log(n) to normalize it

adrian-pace commented 6 years ago

To check if they altern between paragraphs. Go over each paragraph in order and check if the next paragraph just after has a different author. If yes, counter+=1 otherwise nothing. Then divide by n_paragraph + 1 to normalize

adrian-pace commented 6 years ago

Do a similar system of proportion per pad but with paragraphs: Entropy of each paragraphs and sum all them and divide by number of paragaphs

adrian-pace commented 6 years ago

Como es la situacion senor @pykcel ?

Is it implemented ? If it isn't I can start writing it tomorrow

adrian-pace commented 6 years ago

scores

lbaligand commented 6 years ago

For the entropy computations, we often have proportions equal to zero in paragraphs so as stated in this book: https://books.google.ch/books?id=FdsUBQAAQBAJ&pg=PA55&lpg=PA55&dq=entropy+with+zero+proportions&source=bl&ots=xiWp7GCDBY&sig=c5bnNoGMmh-C72qWJfslPDZylus&hl=fr&sa=X&ved=0ahUKEwj9iY-zj-HXAhWiB8AKHfpYA1oQ6AEITzAF#v=onepage&q=entropy%20with%20zero%20proportions&f=false

We put 0.000001 instead of zero.

lbaligand commented 6 years ago

The break score is very low.. So low that we might reach an underflow and/or overflow.

adrian-pace commented 6 years ago

tenor

adrian-pace commented 6 years ago

I think we have enough of them. What do you think ?

lbaligand commented 6 years ago

Yes, we can start finding correlations/regression as you mentioned in the other issue ;)

adrian-pace commented 6 years ago

Yep. Just this line is not working though in type_overall_score: norm_user = np.nan_to_num(norm_type/norm_type.sum(axis=0))
I get RuntimeWarning: invalid value encountered in true_divide norm_user = np.nan_to_num(norm_type/norm_type.sum(axis=0)) If I print the values you divide, I have : [[ 0. 1. 0. 0.]] [ 0. 1. 0. 0.]
If you want to divide element-wise, you should have the same dimensions (here we have (1,4) and (4,). Also you can't divide by zero.
I got the error writing "hi !" on a collab-react-component pad.

adrian-pace / FROG-analytics

Find a good metric #2