Open inishchith opened 5 years ago
Thank you @inishchith to open this issue. We could start commenting on possible improvements and additions to trigger discussions and evaluations. Maybe later we could create other issues to focus on specific tasks. Find below some ideas, anyone is free to share his own ideas (cc @jgbarah ).
too much python The backends CoQua, CoDep and CoVuln rely on analyzers for Python. It could be useful to include analyzers that target other (popular) languages.
how to deal with multi-language repos Some repos rely on more than one programming language (e.g. frontend and backend languages), so Graal could execute different analyzers when processing a repo. A similar approach has been already implemented for CoCom, which relies on cloc and lizard (triggered based on the extensions of the files processed by Graal). Another option could be to execute Graal several times, one per each programming language, this would allow the user to focus on a specific language.
analysis of configuration files Nowadays configuration files are pretty popular, but currently Graal doesn't take them into account. For instance, some tools already exist to validate Docker file, while other tools inspect the content of docker containers to extract useful information about dependencies and vulnerabilities (cc @neglectos).
Thanks for the opportunity. I share @valeriocos concerns and ideas. One additional improvement could be to run some tool / heuristics on every file to infer its programming language. This would allow to skip binary files from analysis, but also target specific tools to specific files based on language, for example.
One option for this would be to use linguist, which is the tool GitHub uses for this matter.
@jgbarah what do you think to integrate linguist as a backend in graal ?
Not sure if as a backend, but it could be an idea.
The fact is that now I'm not aware of telling graal to say something like "if it is C, run this set of tools, if it is Python, run this other set, and if it is source code in any other language, run this one". linguist could help in this. Maybe by running lingusist first (as a backend), and based on the output from it, decide on the other tools to run...
One option for this would be to use linguist, which is the tool GitHub uses for this matter.
@valeriocos Adding linguist as a backend would be really leveraged on as the project progresses. I would be interested in working on adding the backend support once i've some clarity over the idea and it's functioning.
The fact is that now I'm not aware of telling graal to say something like "if it is C, run this set of tools, if it is Python, run this other set, and if it is source code in any other language, run this one". linguist could help in this. Maybe by running lingusist first (as a backend), and based on the output from it, decide on the other tools to run...
@jgbarah That sounds interesting. I'm thinking something related to adding flags
to restrict analysis on to specific languages. I'll add a more clearer Idea here after thinking this through.
@valeriocos I'd like to some clarity over adding linguist
as a backend. Let me know what you think :)
Thanks
Edit: sorry i by mistake, closed the issue. Have reopened it
Sure @inishchith , I'll have a look at it and go back to you in the next days. Thanks
@valeriocos After understanding @jgbarah 's suggestion. I'm thinking we could integrate linguist
as a backend ( analyzer for CoCom ) in order to infer the programming language used in a repository. And can be useful to implement metrics based on Percentage Programming Language
of a software development repository with multiple languages in future.
Let me know what you think :) I'd be interested in working through a solution. Thanks
Thank you @inishchith , I like the idea! Let's see how to proceed :)
Why you would like to add linguist
as an analyzer for CoCom instead of creating a new backend (e.g., CoLang) ?
AFAIU linguist
returns the percentage of programming languages used in a repo, taking as input the path of the repo (or the snapshot at a given commit), which seems to be incompatible with the logic used in CoCom
, which analyzes every file in a commit.
The new backend could rely on two analyzers, linguist and cloc, but in this case the latter would be executed on the full repo (instead of on a single file as done for the CoCom backend).
What do you think ?
@valeriocos Thanks for the response.
Why you would like to add linguist as an analyzer for CoCom instead of creating a new backend (e.g., CoLang)?
Sorry, I missed something out there. I was thinking on the lines of inferring the programming language may only
be useful in Code complexity analysis, whereas it can be extensible and useful to other backends as well, also the idea of adding an analyzer for CoCom backend doesn't fit well as you later said.
AFAIU linguist returns the percentage of programming languages used in a repo, taking as input the path of the repo (or the snapshot at a given commit), which seems to be incompatible with the logic used in CoCom, which analyzes every file in a commit.
Exactly! In that case, we should have a new backend ( CoLang, as you suggested )
The new backend could rely on two analyzers, linguist and cloc, but in this case the latter would be executed on the full repo (instead of on a single file as done for the CoCom backend).
Yes. This adds more clarity to the idea!.
@valeriocos @jgbarah Please let me know if can start working on this. We can have a discussion on the structure of the output produced and tests to be added once incorporating the new Idea in the corresponding PR.
Thank you for your prompt reply @inishchith ! +1 from my side, let's wait for @jgbarah 's feedback
Just an idea that popped up right now. Maybe the work to be done for this new backend could be shared with other people interested in this proposal. Since you have already some experience in writing an analyzer, you could focus on writing the backend, while the analyzers could be done by others. It is possible that the development will be slower, but it can be a good experience for those ones involved.
Hi everyone, sorry for not being active in the discussions. College commitments are taking all of my time right now. As mentioned in the proposal, I'll be free from 16th and will work hard in thinking about CoLang as a backend
Thank you @apoorvaanand1998 for your interest. If you want, you can also explore how to integrate:
What do you think ?
List of tools:
Sorry for the delayed response. @valeriocos
Just an idea that popped up right now. Maybe the work to be done for this new backend could be shared with other people interested in this proposal. Since you have already some experience in writing an analyzer, you could focus on writing the backend, while the analyzers could be done by others. It is possible that the development will be slower, but it can be a good experience for those ones involved.
I was thinking to add CoLang
Backend along with linguist
analyzer initially, and then open up the corresponding issue with a proper description for tasks remaining. Some of the splits being:
cloc
analyzer with CoLang
BackendLet me know what you think. I'd be comfortable on making changes and going ahead with your suggestions. Thanks :)
It sounds perfect @inishchith , feel free to start when you want, thanks.
@valeriocos Thanks for the speedy response.
@apoorvaanand1998 Thanks for your interest in the discussion. Feel free to add your ideas here. We'll have some issues open in the next few days :)
@valeriocos I needed some suggestions here regarding the result to be produced.
The output produced by linguist
for a given repository. (An instance, for kibiter repository) would be:
91.14% JavaScript
5.26% HTML
3.40% CSS
0.09% Shell
0.06% Dockerfile
0.04% CartoCSS
0.02% Batchfile
JavaScript:
Gruntfile.js
packages/eslint-config-kibana/.eslintrc.js
packages/eslint-config-kibana/jest.js
packages/eslint-plugin-kibana-custom/index.js
scripts/backport.js
........
.......
I'm thinking of the following structure of result for every snapshot at a given commit. ( breakdown
in case of set details
flag )
{
"languages":{
"JavaScript": 91.14,
"HTML": 5.26,
"CSS": 3.40,
...
}
"breakdown":{
"JavaScript": ["Gruntfile.js", "packages/eslint-config-kibana/.eslintrc.js", "packages/eslint-config-kibana/jest.js" ... ],
"HTML": ..
........
........
}
}
What do you think?
I'm not sure about the breakdown section, for large repositories this could be a really long list. We could start with the easiest solution, no breakdown section, and add it in the future (maybe, a breakdown at folder level). What do you think @inishchith ?
@valeriocos For large repositories, Yes ,it'd be a long list and cause a clutter in the result produced.
The idea of breakdown at folder level sounds good to me, would require an explict entrypoint from the user. I'll mark the breakdown
task as aTODO
Thanks for the suggestion. I'll open a PR soon :)
great! thanks @inishchith
Thank you @apoorvaanand1998 for your interest. If you want, you can also explore how to integrate:
* [Sonarqube](https://www.sonarqube.org/) data * Other dependencies tools (e.g., [SonarGraph](https://github.com/sonargraph)) * Support for COBOL analysis tool <--- which would be really good to have :)
What do you think ?
List of tools:
* https://github.com/mre/awesome-static-analysis
Hi @valeriocos, sorry for the late response. I've been looking into COBOL analyzers, and I cannot find anything that is open source. Everything is a "product". The only thing I could find was this, but I couldn't find any documentation on it. I feel like this is a dead end.
SonarQube has an analyzer for COBOL called SonarCOBOL, but it is only available in the enterprise edition. The link you provided for the open source SonarGraph components also require SonarGraph which is also a commercial tool.
There is SonarQube community edition and SonarGraph explorer which are free and open source. Should I explore these? I don't know enough about them to know if they can even easily be integrated.
While doing my research, I found Yasca which has a "COBOL analyzer". Yasca is a depreciated open source project. It had this analyzer, which if I understand correctly only does one thing - Counts the number of getmains and freemains and sees if they're equal. I don't know enough about COBOL to understand what these are though, but IMO I don't think these produce enough data?
How do I proceed from here?
@valeriocos Ping. I'm really stuck. Can you point me in the right direction?
Sorry @apoorvaanand1998 for the late reply. What do you think about improving the support for Java projects in Graal ?
A dependency analyzer for Java projects using maven or gradle could be a nice addition. Another option is to look on Internet for open source tools tailored to Java (e.g, https://devua.co/2017/07/19/java-code-quality-tools/?i=1) and select one to be included in graal.
I shall check these out @valeriocos, thank you. I was also thinking of translating the yasca analyzer for Cobol to python, and sending a PR. At least this way we can get started with a COBOL analyzer. Does this sound like a good idea?
You're welcome @apoorvaanand1998 .
I was also thinking of translating the yasca analyzer for Cobol to python, and sending a PR
It sounds like too much work. Probably the idea of providing support for COBOL wasn't good, as you said there is almost nothing outside to be plugged into Graal. Maybe it is better to focus on other languages, more popular and with more available analyzers. What do you think ?
I agree @valeriocos, I shall get started on my research and when I have a clear idea, I'll open an issue for more specific discussion. Is that okay?
that's perfect @apoorvaanand1998 , thank you :)
This thread is for discussion related to