[Question][Data Models] Code Quality / Testing

martynvdijke commented 1 month ago

Question

Hi all,

I have a question I hope you can help me with. I see from the roadmap over https://devlake.apache.org/docs/Overview/Roadmap/ that Code Quality / Testing is on the roadmap. Could you maybe explain these data models a bit more and how you plan on getting / ingesting this data? And your vision and idea behind that is, what you want to measure and how ?

Background: I want to see if it is possible to measure code coverage and quality metrics next to build time and correlate them. We have some long ci build pipelines and are curious if we can fine tune the number of tests vs times, or at least gain insights in them at best and maybe split them up or so. We have the feeling this can be better in balance at least.

I initially thought of using the webhook for this, but I now found out that the webhooks really are hooked into data models (which makes sense). I am not sure if you are thinking about adding a generic webook data model, so that special use cases can or extra data can be added ?

So long story short, can you maybe explain the vision and idea behind the Code Quality / Testing data model ?

Also awsome work on this project so far !

dosubot[bot] commented 1 month ago

Hey @martynvdijke!

I'm a bot here to help you with bugs, questions, and contributions. Let's tackle that issue together!

The Code Quality / Testing data model in Apache DevLake includes a comprehensive set of metrics and data points designed to provide detailed insights into code quality and testing. Here are the specific metrics included:

ProjectKey
FileName
FilePath
FileLanguage
CodeSmells
SqaleIndex
SqaleRating
Bugs
ReliabilityRating
Vulnerabilities
SecurityRating
SecurityHotspots
SecurityHotspotsReviewed
SecurityReviewRating
Ncloc
UncoveredLines
Coverage
LinesToCover
DuplicatedLinesDensity
DuplicatedBlocks
DuplicatedFiles
DuplicatedLines
EffortToReachMaintainabilityRatingA
Complexity
CognitiveComplexity
NumOfLines

These metrics are defined in the CqFileMetrics struct in the cq_file_metrics.go file ^[1].

Ingesting Data

Apache DevLake ingests data related to code quality and testing from various DevOps tools through its integrative DevOps data plugins. These plugins include tools like GitHub, GitLab, JIRA, Jenkins, and others. The data collection, expansion, and conversion to the domain layer are managed by running subtasks within these plugins. Users can trigger these tasks via a configuration UI or API ^{[[2]](https://github.com/apache/incubator-devlake-website/blob/main/blog/2022-06-17-How DevLake is up and running/index.md)}.

Vision and Idea

The vision behind the Code Quality / Testing data model is to provide a unified platform that can ingest, analyze, and visualize fragmented data from various DevOps tools. This allows teams to extract actionable insights for engineering excellence, developer experience, and community growth. By correlating code quality metrics with other data points like build times, teams can optimize their development processes, fine-tune CI pipelines, and balance the number of tests versus build times.

Webhooks for Custom Data

Regarding the use of webhooks for custom or special use case data, Apache DevLake currently supports pushing data such as "incidents" and "deployments" required by DORA metrics. While the webhook system is quite flexible, it is currently limited to specific data types. However, the system can be configured to relate incidents with deployments at different levels (e.g., service level, product level) ^[3].

Correlating Metrics

Apache DevLake correlates different types of data by using a standard data model to bring together data from across the Software Development Life Cycle (SDLC). This unified data integration allows for the creation of intuitive, use-case driven dashboards that provide actionable insights. For example, you can measure code coverage and quality metrics alongside build times to identify areas for optimization in your CI pipelines ^[4].

If you have any more specific questions or need further details, feel free to ask!

To continue talking to Dosu, mention @dosu.

d4x1 commented 1 month ago

@martynvdijke Thanks for your issue.

How to process Code Quality is a difficult question.

I cannot answer your question at once, but here are some information and my ideas I can provide.

We have only one datasource SonarQube, that belongs to DOMAIN_TYPE_CODE_QUALITY at present, and the domain layer tables which as be found https://github.com/apache/incubator-devlake/tree/main/backend/core/models/domainlayer/codequality are mapped from SonarQube's data models directly.

As for your questions,

"I am not sure if you are thinking about adding a generic webook data model, so that special use cases can or extra data can be added ?" You can describe it in detail, adding a webhook is easy, but what do you mean by adding a GENRIC webhook data model?
"Fine tune the number of tests vs times" You can:
- try to build a relation: incident -> line of code -> test, tests with a high incident rate should be run as much as possible.
- calculate the overlap of tests, and mark some tests are more efficient.

We haven't received enough feedbacks about code quality, so you can give us some advice on it. We can explore it together.

martynvdijke commented 1 month ago

Thanks for the response this clears parts up, thanks for taking the time 👍

My tought is that a generic data model / data source could be a nice addition to allow for some customization or realted uses cases that are currently not fully covered. A use case now would be if we can use a generic webhook data model to upload our python code coverage reports. So that we can use that in a custom dashboard.
That sounds like a nice plan, will do some tinkering around to see whats possible

We are also searching for a bit here, but I think high over we would want to at least start with:

Upload code coverage metrics of our unit and feature tests (in python)
Upload the code complexity score

So that we can track that over time and see our improvments in this area.

Hope this makes a bit more clear ?

d4x1 commented 1 month ago

@martynvdijke Webhook is suitable for posting some additional data. Let's talk about this generic data model first, it can be added with little effort. If you post a json data, what is the expected effect?

A new table is created and all json fields are flattened and have a corresponding column? like:

id	field_1	field_2
x	{json_data.field_1}	{json_data.field_1}

Or just set a new table like:

id	data
x	{json_data}

Should this table be created arrcording to the post data dynamically?

martynvdijke commented 1 month ago

Hey,

I think option 1 would be the nicest it would allow you to query the data in grafana in the best way I think. Or at least from a user perspective in the best way.

d4x1 commented 1 month ago

@martynvdijke Are you interested in submiting a PR? It doesn't need too much work.

martynvdijke commented 1 month ago

Hey,

Yeah I should have some time next week to dive into this and submit a PR for this.

If you have time some pointers on how to get the database formatting from the json would be nice, never done anything like that.

d4x1 commented 1 month ago

@martynvdijke

Start a new api handler in plugin webhook
In this handler
1. get request body
2. parse request body
3. add new tables or new fields in a certain table according to the request data (I think this will help: https://github.com/apache/incubator-devlake/blob/main/backend/plugins/customize/service/service.go#L113)

apache / incubator-devlake