Closed adrian-pace closed 6 years ago
After going through the first 150 pads, here is some interesting pads: In this pad, we see that there is a good collab in the first few lines (users present themselves) and then only one person is writing.
Two examples of bad collab (pad:640911640911640911640911640911640911 and 753268753268753268753268753268753268)
One example of good collab (pad:133522133522133522133522133522133522):
We should take into consideration in the metrics:
Let n be the number of authors different from 'Etherpad_admin', p_i being the proportion written by the author i. The proportion score should be between 0 and 1 and it should be maximized when the pis are balanced. The two ways to compute the score would be to use $\prod{i=1}^{n} pi$ or $\sum{i=1}^{n} p_i log_n(1/p_i). For the synchronized score we can simply take the proportion of synchronized writing in the pad. For the breaks we should compute the overall length of the pad and compute a proportion of break. One baseline would be to divide the number of small break multiplied by 100 * avg length of one word (around 7) by the overall length of the pad and do the same for long pad but multiply the breaks by 1000. We still need to find a way to evaluate if the paragraph authors are alternating.
We can finally do a weighted average of all these scores given their importance.
good idea for entropy, also divide by log(n) to normalize it
To check if they altern between paragraphs. Go over each paragraph in order and check if the next paragraph just after has a different author. If yes, counter+=1 otherwise nothing. Then divide by n_paragraph + 1 to normalize
Do a similar system of proportion per pad but with paragraphs: Entropy of each paragraphs and sum all them and divide by number of paragaphs
Como es la situacion senor @pykcel ?
Is it implemented ? If it isn't I can start writing it tomorrow
For the entropy computations, we often have proportions equal to zero in paragraphs so as stated in this book: https://books.google.ch/books?id=FdsUBQAAQBAJ&pg=PA55&lpg=PA55&dq=entropy+with+zero+proportions&source=bl&ots=xiWp7GCDBY&sig=c5bnNoGMmh-C72qWJfslPDZylus&hl=fr&sa=X&ved=0ahUKEwj9iY-zj-HXAhWiB8AKHfpYA1oQ6AEITzAF#v=onepage&q=entropy%20with%20zero%20proportions&f=false
We put 0.000001 instead of zero.
The break score is very low.. So low that we might reach an underflow and/or overflow.
I think we have enough of them. What do you think ?
Yes, we can start finding correlations/regression as you mentioned in the other issue ;)
Yep. Just this line is not working though in type_overall_score:
norm_user = np.nan_to_num(norm_type/norm_type.sum(axis=0))
I get RuntimeWarning: invalid value encountered in true_divide norm_user = np.nan_to_num(norm_type/norm_type.sum(axis=0))
If I print the values you divide, I have :
[[ 0. 1. 0. 0.]] [ 0. 1. 0. 0.]
If you want to divide element-wise, you should have the same dimensions (here we have (1,4) and (4,). Also you can't divide by zero.
I got the error writing "hi !" on a collab-react-component pad.
That could encapsulate what we display on the graphs in a value