[discussion] Improvements in existing analyzers and Additions

inishchith commented 5 years ago

This thread is for discussion related to

Improvements in existing analyzer
Addition of new analyzers under corresponding backends

valeriocos commented 5 years ago

Thank you @inishchith to open this issue. We could start commenting on possible improvements and additions to trigger discussions and evaluations. Maybe later we could create other issues to focus on specific tasks. Find below some ideas, anyone is free to share his own ideas (cc @jgbarah ).

too much python The backends CoQua, CoDep and CoVuln rely on analyzers for Python. It could be useful to include analyzers that target other (popular) languages.

how to deal with multi-language repos Some repos rely on more than one programming language (e.g. frontend and backend languages), so Graal could execute different analyzers when processing a repo. A similar approach has been already implemented for CoCom, which relies on cloc and lizard (triggered based on the extensions of the files processed by Graal). Another option could be to execute Graal several times, one per each programming language, this would allow the user to focus on a specific language.

analysis of configuration files Nowadays configuration files are pretty popular, but currently Graal doesn't take them into account. For instance, some tools already exist to validate Docker file, while other tools inspect the content of docker containers to extract useful information about dependencies and vulnerabilities (cc @neglectos).

jgbarah commented 5 years ago

Thanks for the opportunity. I share @valeriocos concerns and ideas. One additional improvement could be to run some tool / heuristics on every file to infer its programming language. This would allow to skip binary files from analysis, but also target specific tools to specific files based on language, for example.

One option for this would be to use linguist, which is the tool GitHub uses for this matter.

valeriocos commented 5 years ago

@jgbarah what do you think to integrate linguist as a backend in graal ?

jgbarah commented 5 years ago

Not sure if as a backend, but it could be an idea.

The fact is that now I'm not aware of telling graal to say something like "if it is C, run this set of tools, if it is Python, run this other set, and if it is source code in any other language, run this one". linguist could help in this. Maybe by running lingusist first (as a backend), and based on the output from it, decide on the other tools to run...

inishchith commented 5 years ago

One option for this would be to use linguist, which is the tool GitHub uses for this matter.

@valeriocos Adding linguist as a backend would be really leveraged on as the project progresses. I would be interested in working on adding the backend support once i've some clarity over the idea and it's functioning.

The fact is that now I'm not aware of telling graal to say something like "if it is C, run this set of tools, if it is Python, run this other set, and if it is source code in any other language, run this one". linguist could help in this. Maybe by running lingusist first (as a backend), and based on the output from it, decide on the other tools to run...

@jgbarah That sounds interesting. I'm thinking something related to adding flags to restrict analysis on to specific languages. I'll add a more clearer Idea here after thinking this through.

inishchith commented 5 years ago

@valeriocos I'd like to some clarity over adding linguist as a backend. Let me know what you think :) Thanks

Edit: sorry i by mistake, closed the issue. Have reopened it

valeriocos commented 5 years ago

Sure @inishchith , I'll have a look at it and go back to you in the next days. Thanks

inishchith commented 5 years ago

@valeriocos After understanding @jgbarah 's suggestion. I'm thinking we could integrate linguist as a backend ( analyzer for CoCom ) in order to infer the programming language used in a repository. And can be useful to implement metrics based on Percentage Programming Language of a software development repository with multiple languages in future.

Let me know what you think :) I'd be interested in working through a solution. Thanks

valeriocos commented 5 years ago

Thank you @inishchith , I like the idea! Let's see how to proceed :)

Why you would like to add linguist as an analyzer for CoCom instead of creating a new backend (e.g., CoLang) ?

AFAIU linguist returns the percentage of programming languages used in a repo, taking as input the path of the repo (or the snapshot at a given commit), which seems to be incompatible with the logic used in CoCom, which analyzes every file in a commit.

The new backend could rely on two analyzers, linguist and cloc, but in this case the latter would be executed on the full repo (instead of on a single file as done for the CoCom backend).

What do you think ?

inishchith commented 5 years ago

@valeriocos Thanks for the response.

Why you would like to add linguist as an analyzer for CoCom instead of creating a new backend (e.g., CoLang)?

Sorry, I missed something out there. I was thinking on the lines of inferring the programming language may only be useful in Code complexity analysis, whereas it can be extensible and useful to other backends as well, also the idea of adding an analyzer for CoCom backend doesn't fit well as you later said.

AFAIU linguist returns the percentage of programming languages used in a repo, taking as input the path of the repo (or the snapshot at a given commit), which seems to be incompatible with the logic used in CoCom, which analyzes every file in a commit.

Exactly! In that case, we should have a new backend ( CoLang, as you suggested )

The new backend could rely on two analyzers, linguist and cloc, but in this case the latter would be executed on the full repo (instead of on a single file as done for the CoCom backend).

Yes. This adds more clarity to the idea!.

@valeriocos @jgbarah Please let me know if can start working on this. We can have a discussion on the structure of the output produced and tests to be added once incorporating the new Idea in the corresponding PR.

valeriocos commented 5 years ago

Thank you for your prompt reply @inishchith ! +1 from my side, let's wait for @jgbarah 's feedback

Just an idea that popped up right now. Maybe the work to be done for this new backend could be shared with other people interested in this proposal. Since you have already some experience in writing an analyzer, you could focus on writing the backend, while the analyzers could be done by others. It is possible that the development will be slower, but it can be a good experience for those ones involved.

apoorvaanand1998 commented 5 years ago

Hi everyone, sorry for not being active in the discussions. College commitments are taking all of my time right now. As mentioned in the proposal, I'll be free from 16th and will work hard in thinking about CoLang as a backend

valeriocos commented 5 years ago

Thank you @apoorvaanand1998 for your interest. If you want, you can also explore how to integrate:

Sonarqube data
Other dependencies tools (e.g., SonarGraph)
Support for COBOL analysis tool <--- which would be really good to have :)

What do you think ?

List of tools:

https://github.com/mre/awesome-static-analysis

inishchith commented 5 years ago

Sorry for the delayed response. @valeriocos

Just an idea that popped up right now. Maybe the work to be done for this new backend could be shared with other people interested in this proposal. Since you have already some experience in writing an analyzer, you could focus on writing the backend, while the analyzers could be done by others. It is possible that the development will be slower, but it can be a good experience for those ones involved.

I was thinking to add CoLang Backend along with linguist analyzer initially, and then open up the corresponding issue with a proper description for tasks remaining. Some of the splits being:

Integrating cloc analyzer with CoLang Backend
Adding appropriate unit tests
Adding documentation

Let me know what you think. I'd be comfortable on making changes and going ahead with your suggestions. Thanks :)

valeriocos commented 5 years ago

It sounds perfect @inishchith , feel free to start when you want, thanks.

inishchith commented 5 years ago

@valeriocos Thanks for the speedy response.

@apoorvaanand1998 Thanks for your interest in the discussion. Feel free to add your ideas here. We'll have some issues open in the next few days :)

inishchith commented 5 years ago

@valeriocos I needed some suggestions here regarding the result to be produced.

The output produced by linguist for a given repository. (An instance, for kibiter repository) would be:

91.14%  JavaScript
5.26%   HTML
3.40%   CSS
0.09%   Shell
0.06%   Dockerfile
0.04%   CartoCSS
0.02%   Batchfile

JavaScript:
Gruntfile.js
packages/eslint-config-kibana/.eslintrc.js
packages/eslint-config-kibana/jest.js
packages/eslint-plugin-kibana-custom/index.js
scripts/backport.js
........
.......

I'm thinking of the following structure of result for every snapshot at a given commit. ( breakdown in case of set details flag )

{
             "languages":{
                       "JavaScript": 91.14,
                       "HTML": 5.26,
                        "CSS": 3.40,
                                ...
                }
             "breakdown":{
                  "JavaScript": ["Gruntfile.js", "packages/eslint-config-kibana/.eslintrc.js", "packages/eslint-config-kibana/jest.js" ... ],
                  "HTML":  ..
                    ........
                    ........
               }
}

What do you think?

valeriocos commented 5 years ago

I'm not sure about the breakdown section, for large repositories this could be a really long list. We could start with the easiest solution, no breakdown section, and add it in the future (maybe, a breakdown at folder level). What do you think @inishchith ?

inishchith commented 5 years ago

@valeriocos For large repositories, Yes ,it'd be a long list and cause a clutter in the result produced. The idea of breakdown at folder level sounds good to me, would require an explict entrypoint from the user. I'll mark the breakdown task as aTODO Thanks for the suggestion. I'll open a PR soon :)

valeriocos commented 5 years ago

great! thanks @inishchith

apoorvaanand1998 commented 5 years ago

Thank you @apoorvaanand1998 for your interest. If you want, you can also explore how to integrate:
* [Sonarqube](https://www.sonarqube.org/) data

* Other dependencies tools (e.g., [SonarGraph](https://github.com/sonargraph))

* Support for COBOL analysis tool <--- which would be really good to have :)
What do you think ?

List of tools:
* https://github.com/mre/awesome-static-analysis

Hi @valeriocos, sorry for the late response. I've been looking into COBOL analyzers, and I cannot find anything that is open source. Everything is a "product". The only thing I could find was this, but I couldn't find any documentation on it. I feel like this is a dead end.

SonarQube has an analyzer for COBOL called SonarCOBOL, but it is only available in the enterprise edition. The link you provided for the open source SonarGraph components also require SonarGraph which is also a commercial tool.

There is SonarQube community edition and SonarGraph explorer which are free and open source. Should I explore these? I don't know enough about them to know if they can even easily be integrated.

While doing my research, I found Yasca which has a "COBOL analyzer". Yasca is a depreciated open source project. It had this analyzer, which if I understand correctly only does one thing - Counts the number of getmains and freemains and sees if they're equal. I don't know enough about COBOL to understand what these are though, but IMO I don't think these produce enough data?

How do I proceed from here?

apoorvaanand1998 commented 5 years ago

@valeriocos Ping. I'm really stuck. Can you point me in the right direction?

valeriocos commented 5 years ago

Sorry @apoorvaanand1998 for the late reply. What do you think about improving the support for Java projects in Graal ?

A dependency analyzer for Java projects using maven or gradle could be a nice addition. Another option is to look on Internet for open source tools tailored to Java (e.g, https://devua.co/2017/07/19/java-code-quality-tools/?i=1) and select one to be included in graal.

apoorvaanand1998 commented 5 years ago

I shall check these out @valeriocos, thank you. I was also thinking of translating the yasca analyzer for Cobol to python, and sending a PR. At least this way we can get started with a COBOL analyzer. Does this sound like a good idea?

valeriocos commented 5 years ago

You're welcome @apoorvaanand1998 .

I was also thinking of translating the yasca analyzer for Cobol to python, and sending a PR

It sounds like too much work. Probably the idea of providing support for COBOL wasn't good, as you said there is almost nothing outside to be plugged into Graal. Maybe it is better to focus on other languages, more popular and with more available analyzers. What do you think ?

apoorvaanand1998 commented 5 years ago

I agree @valeriocos, I shall get started on my research and when I have a clear idea, I'll open an issue for more specific discussion. Is that okay?

valeriocos commented 5 years ago

that's perfect @apoorvaanand1998 , thank you :)

chaoss / grimoirelab-graal

[discussion] Improvements in existing analyzers and Additions #18