[cocom] Add repository level analysis option for CoCom backend

inishchith commented 5 years ago

@valeriocos As discussed, this adds repository level analysis as an option to CoCom Backend.

As of now, this is being added to make repository level analysis and visualization to be carried out in Kibana in a bit easier way. We can discuss the limitation that might be caused in the future and some edge cases that I might have missed in the implementation.

Results of comparison can be found in #36

Note: Some work/commit(s) are addressed and are a continuation of #37

Edit: This is just a rough idea (implementation)

Some things that need to be worked on:

[ ] Tests
[ ] Docstrings

inishchith commented 5 years ago

@valeriocos Sorry for the delayed response.

Instead of passing a flag to trigger the analysis at repo level, it wouldn't be better to introduce a new category that performs this kind of analysis? What do you think?

We have used category as a convention to perform analysis using different analyzers as of now and repository-level analysis does make sense to other backends as well so thought it would be better to provide an option(flag) instead of a category. let me know what you think

The variable self.history collects the cocom analysis for each file. The solution seems to work well when performing the initial fetch, but I'm not sure it would work for incremental fetches. Could you explain the logic in case I'm missing something?

Yes, you're correct. I had a thought about this(midway) but in the process missed out to address it. This could be a problem we can look to solve. I'll give a thought on how the current implementation can be refactored to support incremental fetches. Thanks for addressing it 👍

A different approach could be to execute the analyzer over the full repo. for each commit and then sum up the results obtained? Maybe the param -t may speed up the analysis. What do you think?

Yes. this was one of the initial approaches that we'd addressed in #36. I thought the current idea could outperform in term of time and redundancy in the calculation. ( it's like cutting down the history part and then thinking of a workable solution ). But Yes; agreed here, we face a trade-off between Time(full-repo on every commit) and Memory(maintaining self.history). And as(pointed by you), if we can get incremental fetch working on the current implementation then it'd be great, else have to evaluate lizard's full-repo with WORKING_THREADS option.

I'll update you once i've worked on the evaluation. Let me know what you think!

valeriocos commented 5 years ago

No worries @inishchith , thank you for answering.

We have used category as a convention to perform analysis using different analyzers as of now and repository-level analysis does make sense to other backends as well so thought it would be better to provide an option(flag) instead of a category. let me know what you think

If the data has a different shape it's probably better to use a different category. However, we can proceed without adding a new category, and change the code afterwards if needed :) I'm not sure about the definition of the variable self.repository_level in __init__. That information seems to be more related to the way the fetch is performed than how the class is initialized. Could you explain why repository_level has been defined as an instance attribute ? Thanks

inishchith commented 5 years ago

If the data has a different shape it's probably better to use a different category. However, we can proceed without adding a new category, and change the code afterwards if needed :)

Yes, Agreed. Sure!.

I'm not sure about the definition of the variable self.repository_level in init. That information seems to be more related to the way the fetch is performed than how the class is initialized. Could you explain why repository_level has been defined as an instance attribute?

It shouldn't be an instance attribute. I was working on a different approach (for adding an option) and am not sure when I added this line. Sorry about that and Thanks for pointing it out :)

Edit:

made the correction
@valeriocos If you could review #37 before this, as it's of higher priority. This still requires some work to be done :)

inishchith commented 5 years ago

@valeriocos I had a thought over the incremental fetches issue regarding the current implementation. I figured out that we have to execute an initial run over the entire repository every-time that would again increase the execution time for large repositories as addressed in #36 .

As pointed by you above (about lizard's worker thread for repository-level analysis):

A different approach could be to execute the analyzer over the full repo for each commit and then sum up the results obtained ? Maybe the param -t may speed up the analysis ...

Here the incremental fetches wouldn't be affected and would work just as before. I have implemented a version locally for evaluation purpose, (below are the results).

Repository	Number of Commits	*File Level	Repository Level
[chaoss/grimoirelab-perceval]()	1387	23.65 min	27.97 min
[chaoss/grimoirelab-sirmordred]()	869	9.69 min	4.27 min
[chaoss/grimoirelab-graal]()	169	1.73 min	0.90 min

(there's a divergence due to Perceval have a lot more files than the other repositories in consideration). With the help of lizard's repository-level analysis, I was able to create two of the metric visualization ( Overall LOC and CCN and other attributes). REF. https://github.com/chaoss/metrics/issues/139. And I'm now on to visualizing the most complex files in a repository.

Let me know what you think. Thanks :)

inishchith commented 5 years ago

Closing in reference to #39

chaoss / grimoirelab-graal

[cocom] Add repository level analysis option for CoCom backend #38