chendong0444 / gitinspector

Automatically exported from code.google.com/p/gitinspector
GNU General Public License v3.0
0 stars 0 forks source link

Show modified lines percentage (code stability) #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The number of commits, the total number of lines added/removed or the number of 
lines remaining in the code are interesting statistics. Thanks for extracting 
those informations.

Maybe it would be interesting to know, out of all the lines modified by a 
developer, what is the percentage of all those lines that have been changed 
later on. Such statistics could be useful in the following context: Say I write 
a new function with lots of bugs in it, over time, a lot of my original lines 
will be modified (as bug fixes). Note that the number of commits in that 
previous example was inversely proportional to the "quality" of the work.
Another example where the statistic I am suggesting could be interesting could 
be when designing features. How scalable is the code? each time a new sub 
feature is added, how many existing lines must be changed ? Does the code need 
to be redesigned every time or is it very minimal work to plug-in new features?

For the implementation, maybe using (sometime available) commit message 
keywords such as "fix" "crash" "bug" ... could be useful.

Original issue reported on code.google.com by julien.f...@kitware.com on 19 Jul 2013 at 1:19

GoogleCodeExporter commented 9 years ago
Hi Julien.

That would be interesting; yes (and quite useful). However, I'm not quite sure 
how to implement it in a general (and working) way.

Maybe the easiest way would be to simply try to track how many lines a 
developer has attributed to him, during different times, in the commit history 
by checking a file multiple times with git blame or perhaps by using git blame 
--reverse.

I have a feeling that it will take quite a long time to analyze.

Using commit messages to track if a commit is a fix is not really an option 
considering all the different languages people use.

The most general solution would probably be to do a "Code stability" value for 
each developer that simply tells us how stable their code is (how much of it 
has survived). The average code stability of all developers would also be the 
general code stability of the project. In any case; a statistic such as this 
one also fits nicely into the original use case of gitinspector; the grading of 
student projects.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 20 Jul 2013 at 8:31

GoogleCodeExporter commented 9 years ago
Considering this is a good idea; something like this will be implemented 
eventually (as discussed by my previous post). However; there are other things 
that are of higher priority right now.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 24 Jul 2013 at 7:47

GoogleCodeExporter commented 9 years ago
I think a first implementation could be similar to that:

for each commit in history
     for each added line
          push in a <author, file line, count>map the added line with count = 1
     for each modified or removed line
          search in all authors the modified line
                 increment that line count with 1
          if the modified line was not associated with the current commit author, then add an entry in the map 

Of course there are some subtleties to take into account (such as adding a new 
line should bump the line index of all the following lines in the file)...
But I would be curious to know if something as simple as that brings some 
useful stats.

Original comment by julien.f...@kitware.com on 24 Jul 2013 at 12:30

GoogleCodeExporter commented 9 years ago
That's probably not quite how I would do it. You don't really need to track the 
line-count either.

In any case, your implementation would probably push out something; but it 
wouldn't be correct; because we would be doing it manually without the use of 
the algorithm used in git blame. Meaning the number of attributed lines would 
be out of sync with what we would get from the above. An attributed line is not 
(necessarily) the same thing as an inserted/added line.

I think a similar method but by using git blame on the changed lines (to check 
who it really belongs to) would give us something more meaningful.

It's doable; but will be a little hairy to implement. But something like this 
would probably work better:

for each file that ever existed in repo:
    get blame history for every commit of file
    for each commit of file
        for each added line
            check in blame history to which author the line should be added and do;
                attribbuted_lines += 1
        for each removed line
            check in blame history to which author the removed line was attributed to and do;
                removed_lines += 1

In the end you get a stability value on each author that tells us how stable 
their code is throughout the git history.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 24 Jul 2013 at 2:11

GoogleCodeExporter commented 9 years ago
I'm targeting this for the 0.4.0 release. It will get implemented then or in 
some 0.3.x release up to 0.4.0.

Original comment by gitinspe...@ejwa.se on 25 Jul 2013 at 11:07

GoogleCodeExporter commented 9 years ago
This turned out to be a little more tricky than expected. Work on this issue is 
slowly progressing.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 7 Aug 2013 at 12:38

GoogleCodeExporter commented 9 years ago
I am also considering if the code stability value should take in to account not 
only the rows that have survived but also *how long* they have survived before 
they are removed... How this should be calculated (and how to factor this in,) 
is a very interesting problem.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 7 Aug 2013 at 12:43

GoogleCodeExporter commented 9 years ago
Great idea !
Maybe that info can't be factored out with the "how many" (number of rows 
modified) stats. It might be better to keep those stats separated, and produce 
different graphs to display them instead.  I would keep the *how long* in 
number of days.

Original comment by julien.f...@kitware.com on 7 Aug 2013 at 12:28

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 16154cd0ba94.

Original comment by gitinspe...@ejwa.se on 27 Jan 2014 at 2:12

GoogleCodeExporter commented 9 years ago
After playing around with this for a while (and trying different solutions) I 
managed to implement something that does not slow down analysis in any 
detectable way.

There are now two additional values in the blame output; stability and age. See 
the commit referenced above for more information. While the solution is a lot 
more naive than the ones previously discussed; it still gives good information 
on authors in relation to each other.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 27 Jan 2014 at 2:19

GoogleCodeExporter commented 9 years ago
That's a very nice feature. Nice job !
I've tried it on CTK(https://github.com/Commontk/CTK) and I seem to have some 
odd results (most of the results look fine otherwise), Stability is >100 and 
Age (in month?) > lifetime of the project.

Author                     Rows      Stability          Age       % in comments
Stability >100:
 Luis Ibanez                  83         8300.0        40.25                9.64
 ivmartel                    811          159.0         6.17               41.06
 ivowolf                    3344          314.9         1.83               19.47
Age > lifetime:
 Marco Nolden               9107           56.5       357.58               19.49

Original comment by julien.f...@kitware.com on 27 Jan 2014 at 1:27

GoogleCodeExporter commented 9 years ago
Yep. I noticed this when trying it on some other repos as well.

The age value is a pseudo value (for now) and only makes any sense when 
compared to other authors. It could be a good idea to redo to show months or 
weeks (when the -w flag is given).

When it comes to the stability value, 8300% on 83 rows (Luis) means that Luis 
only has one inserted row but has 83 rows blamed to him (probably duplicates of 
some kind). I guess git is getting a little confused here. I will investigate 
and update this issue when I find a good solution.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 27 Jan 2014 at 10:52

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The age value has been improved with revision a1e90d0a9d46 and now shows the 
age of the authors rows in months (or weeks).

I think that the strange stability value reported is due to git sometimes 
loosing some information (and counting too few insertions) whenever new files 
are added (or old files are moved). At least that is what I suspect. I have 
some ideas on how to get around it and will work on it eventually.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 14 Feb 2014 at 4:34

GoogleCodeExporter commented 9 years ago
Only the age value will be included in the 0.4.0 release. I', bumping up proper 
support for the stability value to 0.4.1. There are a few things that need to 
be changed/reworked before it can be included in a way that makes me completely 
satisfied with it.

/Adam Waldenberg

Original comment by gitinspe...@ejwa.se on 17 Mar 2014 at 7:20

GoogleCodeExporter commented 9 years ago
I'm considering this "completed".

After some investigation it is evident that a stability value over 100% means 
that someone else has added code that has been attributed to somebody else ... 
Consequently... The author making the change get's the insertion but no the 
blame... Resulting in a raised stability for the original author...

Seems correct to me.

-w / --grading will result in age values being displayed in weeks, instead of 
months.

Original comment by gitinspe...@ejwa.se on 3 Nov 2014 at 10:15