ersoykadir / Requirement-Traceability-Analysis

MIT License
2 stars 0 forks source link

Initial Automatization of Trace Links #12

Closed ersoykadir closed 1 year ago

ersoykadir commented 1 year ago

Issue Description

We have manually built, graphed and analyzed traces. Now, we need to automate the link and graph creation. So we will create a data structure for software artifacts to store them. Then we will proceed with building traces between the artifacts automatically, via keyword-based and semantic matching.

Step Details

Steps that will be performed:

Final Actions

Document the results on wiki.

Deadline of the Issue

03.04.2023 @25.59

codingAku commented 1 year ago

The wiki page for automatization of the graphs are here. Please document your findings.

ersoykadir commented 1 year ago

Initial script for parsing is added to codespace. After a meeting with @codingAku to review and refine it, we can proceed with building trace links using the #11's findings. Script basically acquires data through github graphQL API and parses it into a simple node class structure. We need to discuss the details of this node class.

ersoykadir commented 1 year ago

Assumptions(In progress)

How to search?

How do we search with multiple keywords?

Candidate multiple keyword system:

ersoykadir commented 1 year ago

We have combined keyword extractor and parsing results.

After parsing, we created node objects with id and text fields. So we used keyword extractor on text field of requirement nodes. After acquiring a list of keywords for the requirement, we searched each on the existing issues, saving the issue nodes that have matching keywords to a set.

At first, we just merged the sets of found issue nodes for each keyword, resulting in the related issues for a requirement.

We aimed to decrease the noise by the candidate multiple keyword check system, described before. We chose the threshold for the length of matched issues list as 10. We decided to prune the keywords that have more than 10 matching issues. The pruning is done by traversing the matching issues and removing the issues that match only that frequent keyword.

Still some of the requirements that have weird keywords extracted get problematic matchings. Keyword extraction being successful seems very critic.

We are doing search as for each keyword, look for issues!

We can turn this to, for each issue, look for matching keywords. Saving the number of matching keywords for each issue, we can extract the issues that have lots of matching

Per issue, these metrics can be helpful.

Essentially, using tf-idf techniques.

I will provide a summary on parsing and node structures as well as the keyword search system on wiki, after trying couple more things for reducing the noise tomorrow.

ersoykadir commented 1 year ago

For now, we produce a textual results for trace links. First line consists of the requirement number and requirement itself. Next line, the dictionary of extracted keywords and the number of issues that are matched with them. The rest of the lines are the issues that are matched with this requirement, containing issue number and issue title image

Results of first trials can be seen in repo under trace results.

codingAku commented 1 year ago

Hello. I believe with the initial results on the repo, we can close the issue and continue working on the improvements. I didn't see any syntax error in Wiki as well. Thank you.