chaoss / grimoirelab-graal

A Generic Repository AnALyzer
GNU General Public License v3.0
21 stars 62 forks source link

[cocom] Add repository level analysis option for CoCom backend #38

Closed inishchith closed 5 years ago

inishchith commented 5 years ago

@valeriocos As discussed, this adds repository level analysis as an option to CoCom Backend.

As of now, this is being added to make repository level analysis and visualization to be carried out in Kibana in a bit easier way. We can discuss the limitation that might be caused in the future and some edge cases that I might have missed in the implementation.

Results of comparison can be found in #36

Note: Some work/commit(s) are addressed and are a continuation of #37

Edit: This is just a rough idea (implementation)

Some things that need to be worked on:

inishchith commented 5 years ago

@valeriocos Sorry for the delayed response.

Instead of passing a flag to trigger the analysis at repo level, it wouldn't be better to introduce a new category that performs this kind of analysis? What do you think?

The variable self.history collects the cocom analysis for each file. The solution seems to work well when performing the initial fetch, but I'm not sure it would work for incremental fetches. Could you explain the logic in case I'm missing something?

A different approach could be to execute the analyzer over the full repo. for each commit and then sum up the results obtained? Maybe the param -t may speed up the analysis. What do you think?

I'll update you once i've worked on the evaluation. Let me know what you think!

valeriocos commented 5 years ago

No worries @inishchith , thank you for answering.

We have used category as a convention to perform analysis using different analyzers as of now and repository-level analysis does make sense to other backends as well so thought it would be better to provide an option(flag) instead of a category. let me know what you think

If the data has a different shape it's probably better to use a different category. However, we can proceed without adding a new category, and change the code afterwards if needed :) I'm not sure about the definition of the variable self.repository_level in __init__. That information seems to be more related to the way the fetch is performed than how the class is initialized. Could you explain why repository_level has been defined as an instance attribute ? Thanks

inishchith commented 5 years ago

If the data has a different shape it's probably better to use a different category. However, we can proceed without adding a new category, and change the code afterwards if needed :)

I'm not sure about the definition of the variable self.repository_level in init. That information seems to be more related to the way the fetch is performed than how the class is initialized. Could you explain why repository_level has been defined as an instance attribute?

Edit:

  1. made the correction
  2. @valeriocos If you could review #37 before this, as it's of higher priority. This still requires some work to be done :)
inishchith commented 5 years ago

@valeriocos I had a thought over the incremental fetches issue regarding the current implementation. I figured out that we have to execute an initial run over the entire repository every-time that would again increase the execution time for large repositories as addressed in #36 .

As pointed by you above (about lizard's worker thread for repository-level analysis):

A different approach could be to execute the analyzer over the full repo for each commit and then sum up the results obtained ? Maybe the param -t may speed up the analysis ...

Here the incremental fetches wouldn't be affected and would work just as before. I have implemented a version locally for evaluation purpose, (below are the results).

Repository Number of Commits *File Level Repository Level
[chaoss/grimoirelab-perceval]() 1387 23.65 min 27.97 min
[chaoss/grimoirelab-sirmordred]() 869 9.69 min 4.27 min
[chaoss/grimoirelab-graal]() 169 1.73 min 0.90 min

(there's a divergence due to Perceval have a lot more files than the other repositories in consideration). With the help of lizard's repository-level analysis, I was able to create two of the metric visualization ( Overall LOC and CCN and other attributes). REF. https://github.com/chaoss/metrics/issues/139. And I'm now on to visualizing the most complex files in a repository.

Let me know what you think. Thanks :)

inishchith commented 5 years ago

Closing in reference to #39