Comment and discuss results.

JJ commented 5 years ago

From the point of view of whether they show, or not, self-organization... Highlight the repos with the most remarkable changes.

thebooort commented 5 years ago

There are two possible ways to proceed at this point:

We have information about which distribution follow our data (at least which is statistically relevant in front of the others)
We also have a purely powerlaw discussion. If the minimal value (where powerlaw behaviour begins) has changed or not (for example, tensorflow exhibit a dramatic change there), or changes in alpha parameter

JJ commented 5 years ago

Can we do both?

thebooort commented 5 years ago

of course :)

thebooort commented 5 years ago

looking at the results I would say that tensorflow, tpot and django are quite strange. They have changed the xmin value a lot.

I have not included them in the corr matrix plot, even using some scalers form sklearn, they are so big that blur the rest.

JJ commented 5 years ago

Maybe they have big squash merges, which count as a single commit.

El lun., 25 feb. 2019 a las 17:43, Bartolomé Ortiz Viso (< notifications@github.com>) escribió:

looking at the results I would say that tensorflow, tpot and django are quite strange. They have change the xmin value a lot.

I have not include them in the corr matrix plot, even using some scalers form sklearn, they are so big that blur the rest.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geneura-papers/2019-SASO-Repos-Powerlaws/issues/9#issuecomment-467084194, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAB9MjZ3WhUb-rqc6SSzbAGsMs7_jeIks5vRBK3gaJpZM4bPZbv .

-- JJ

JJ commented 5 years ago

Looking at the charts and your description above, I feel I'm missing the big picture here. Give me, in three or four sentences, what they mean, what are the implications, why we want to do this and what can we conclude from the two summary charts.

JJ commented 5 years ago

For instance, what does this mean? Why do we want to know?

corr_matrix

JJ commented 5 years ago

What about this one? What is ND? Is is comparing how many follow one or the other, and ND simply means that the rest of the repos do not follow either? summary

JJ commented 5 years ago

In the case above, it would be better if the columns for 2017 and 2019 were side by side. Off the top of my head, I would say that the probability of these distributions following a truncated power law increases (if we compare it with LogNormal), while the probability of following a pure power law decreases (if we compare it with an exponential). Can't we do a Friedman or something like that to say which distribution wins overall?

thebooort commented 5 years ago

Looking at the charts and your description above, I feel I'm missing the big picture here. Give me, in three or four sentences, what they mean, what are the implications, why we want to do this and what can we conclude from the two summary charts.

There are 3 types of plots:

_name-of-repo comparative__ : here you can see the original data and the fitted distribution alternatives for it. (the main representatives of heavy-tailed distributions), both for 2017 and 2019. They are mostly just to shown results in a graphical way.
test's results There, every bar represents a test between two possible distributions and the result in the 16 repo analyzed. When the test cannot decide, i used ND label (form non-decidable)
correlation matrix Since I have seen some stange values in some repos, I have analyzed if there is some correlation between the changes in the xmin value, alpha value and commit number. (i.e. alpha_dif comes from [adjusted alpha in 2017 - adjusted alpha in 2019]). As you can see in the tables, tensorflow has a big variation in alpha value. This is caused because of the variation in the distribution, since the data does not follow a powerlaw till 10^5 order of magnitude. I just wanted to know if we can extrapolate this assumption to the other repos. But clearly, tensorflow is one of a kind.

JJ commented 5 years ago

So the paper boils down to:

Do code repositories actually follow a power law?
- If they do, at what x_min does it start?
  - Table for x_min in 2017, 2019.
- What alpha would they have? What is the range? What is it related to?
  - Chart plotting PL fit, 2017, 2019.
  - Correlation matrix between xmin and alfa? ( ← does this make sense? )
- If they don't, do they phase-change and start following it all of a sudden?
  - Table with most likely adjustment 2017, 2019.
  - Chart with comparison (chart above) 2017, 2019.

Does this make sense? Can we tell this in the intro?

thebooort commented 5 years ago

Off the top of my head, I would say that the probability of these distributions following a truncated power law increases (if we compare it with LogNormal), while the probability of following a pure power law decreases (if we compare it with an exponential). Can't we do a Friedman or something like that to say which distribution wins overall?

I am now working on that type of comparison. More than to say which win (since this can be hard to define) I am thinking on something like the measure used in Scale-free networks are rare by Clauset et al. There they rank like: strong evidence of following a powerlaw, medium evidence of following a PL, etc. I think it can be easily used in our case and summarize the bar-plot in more understable way.

In the case above, it would be better if the columns for 2017 and 2019 were side by side.

Agreed, good point. btw, As all of the plots have their dataset associated stored in \data, when we decide which are we going to include, they can be changed or generated the way we want with R.

thebooort commented 5 years ago

So the paper boils down to:

* Do code repositories actually follow a power law?

  * If they do, at what x_min does it start?

    * Table for x_min in 2017, 2019.
  * What alpha would they have? What is the range? What is it related to?

    * Chart plotting PL fit, 2017, 2019.
    * Correlation matrix between xmin and alfa? ( ← does this make sense? )

HUmmmm. I believe that a extreme change in xmin could mean a change in the underlying distribution. Therefore, I will no use it here.

On the other hand, I think that it would be nice to measure correlation between alpha and number of commits. This, I think, proves that our system is in some kind of equilibrium, so even when it evolves in time (two years) all of the modifications are regulated and the system does not (dramatically) changes. If there was some kind of correlation I would argue that the system is still changing and no conclusion should be done, right? Let me know what you think on this.

  * If they don't, do they phase-change and start following it all of a sudden?

    * Table with most likely adjustment 2017, 2019.
    * Chart with comparison (chart above) 2017, 2019.

Does this make sense? Can we tell this in the intro?

I think so, yes!

JJ commented 5 years ago

El lun., 25 feb. 2019 a las 20:16, Bartolomé Ortiz Viso (< notifications@github.com>) escribió:

So the paper boils down to:

Do code repositories actually follow a power law?

If they do, at what x_min does it start?

Table for x_min in 2017, 2019.

What alpha would they have? What is the range? What is it related to?

Chart plotting PL fit, 2017, 2019.

Correlation matrix between xmin and alfa? ( ← does this make sense? )

HUmmmm. I believe that a extreme change in xmin could mean a change in the underlying distribution. Therefore, I will no use it here.

OK. Commenting why it does not make sense might help too.

On the other hand, I think that it would be nice to measure correlation between alpha and number of commits. This, I think, proves that our system is in some kind of equilibrium, so even when it evolves in

We need to work with more than 16 repos, then. I don't know if that fits within this paper. We have 32 samples here, if we find that correlation, we could try and put that in the future.

time (two years) all of the modifications are regulated and the system does not (dramatically) changes. If there was some kind of correlation I would argue that the system is still changing and no conclusion should be done, right?

I don't think there will be any kind of equilibrium other than punctuated equilibrium. If they are effectively in a critical state, they are going to evolve all the time. So I would say that's the case.

Let me know what you think on this.

If they don't, do they phase-change and start following it all of a sudden?

Table with most likely adjustment 2017, 2019.

Chart with comparison (chart above) 2017, 2019.

Does this make sense? Can we tell this in the intro?

I think so, yes!

So please write it down (or some equivalent) in the introduction

JJ

thebooort commented 5 years ago

Do code repositories actually follow a power law? Extract p-value from Kolmogorov-Smirnov test to see if there is some evidence of powerlaw
If they do, at what x_min does it start? xmin extracted by default in the previous test
Table for x_min in 2017, 2019.
What alpha would they have? What is the range? What is it related to? Calculated by Newman formula when we know xmin
Chart plotting PL fit, 2017, 2019.
Correlation matrix between xmin and alfa? ( ← does this make sense? )
If they don't, do they phase-change and start following it all of a sudden? loglikelihood ratio tests for alternative models. Here we can compare all models in front of Powerlaws and do an extra truncated-PL vs lognormal. From that extract a score
Table with most likely adjustment 2017, 2019.
Chart with comparison (chart above) 2017, 2019.

I'm now working on the first KS test, the rest are mostly finished. ( just to keep record of what we are doing.)

JJ commented 5 years ago

Please remember and write stuff in the intro and abstract. We can always modify that if the hypotheses do not hold, but the stuff needs to be written, and rewritten, and reviewed, and so on.

geneura-papers / 2019-SASO-Repos-Powerlaws

Comment and discuss results. #9