Initial Automatization of Trace Links

ersoykadir commented 1 year ago

Issue Description

We have manually built, graphed and analyzed traces. Now, we need to automate the link and graph creation. So we will create a data structure for software artifacts to store them. Then we will proceed with building traces between the artifacts automatically, via keyword-based and semantic matching.

Step Details

Steps that will be performed:

[x] Download the issue, pr and commit data from github API(beware of the rate limit!) and requirements from github wiki.
[x] Decide on the data structure to store each artifact as a node. (With titles, descriptions and properties)
[x] Parse the github data into the structure.
[x] Create the script to build the trace links between nodes.
- [x] Research and implement a keyword extraction method.
- [x] Use it on software artifacts to extract keywords and search them among other artifacts, aiming to build the traces.
[x] Visualize the results (can be text based for starters, later graphs will be used)

Final Actions

Document the results on wiki.

Deadline of the Issue

03.04.2023 @25.59

codingAku commented 1 year ago

The wiki page for automatization of the graphs are here. Please document your findings.

ersoykadir commented 1 year ago

Initial script for parsing is added to codespace. After a meeting with @codingAku to review and refine it, we can proceed with building trace links using the #11's findings. Script basically acquires data through github graphQL API and parses it into a simple node class structure. We need to discuss the details of this node class.

ersoykadir commented 1 year ago

Assumptions(In progress)

Comments and commits are assumed to be less than 100 for now
Requirements taken from wiki manually
Requirement headers ignored for now, should be removed, redundant
- Can be done by checking the req number, if its number of ‘.’ changes it can be header
- Number of dots system might fail if the req numbers are not written properly
  - Check especially the case
```
1.1.2.9. Followers, Follows, My Events, Interest Areas,
1.1.2.10. Interests and Knowledge
**1.1.2.10.1** Users shall identify their interest areas
**1.1.2.10.2** Users shall display their interest areas in their profile pages
```
  - Since last 2 reqs missing ending dot(1.1.2.10.1→1.1.2.10.1.), the method fails!
Issues for practice app are ignored.

How to search?

How do we search with multiple keywords?

First idea is to get all nodes that contain any of the keywords
We can improve this maybe by getting rid of some nodes that only matches single keyword
- This assumption might be helpful on requirements like:
  - Users shall be able to delete or edit their notes.
  - keywords: ['notes', 'edit', 'delete']
  - Not letting delete bring all nodes in this trace is helpful
  - Filter these nodes by getting the nodes that match not only ****delete**** but other keywords as well
    - Maybe do this filtering to keywords that are very frequent, has many matching nodes
      - Define frequent?
    - Maybe filter only specific keywords like ‘create, read, update, delete’(CRUD)

Candidate multiple keyword system:

Req1 keywords: key1, key2, key3
key1 : [art1, art2, art3, art4, art5, art6, art7, art8, art9]
- This is a keyword that is very frequent in all nodes
- Getting every node from it will create a lot of noise
key2: [art1, art2, art10, art11]
- A keyword that is less frequent
It can be helpful if key1 can be filtered, by getting only the nodes of key1 that also matches other keywords as well.
key3: [art20]
- A keyword that is niche. This node might be very important.

ersoykadir commented 1 year ago

We have combined keyword extractor and parsing results.

After parsing, we created node objects with id and text fields. So we used keyword extractor on text field of requirement nodes. After acquiring a list of keywords for the requirement, we searched each on the existing issues, saving the issue nodes that have matching keywords to a set.

At first, we just merged the sets of found issue nodes for each keyword, resulting in the related issues for a requirement.

But this seems to have a lot of noise!

We aimed to decrease the noise by the candidate multiple keyword check system, described before. We chose the threshold for the length of matched issues list as 10. We decided to prune the keywords that have more than 10 matching issues. The pruning is done by traversing the matching issues and removing the issues that match only that frequent keyword.

This did provided an improvement on the noise problem.

Still some of the requirements that have weird keywords extracted get problematic matchings. Keyword extraction being successful seems very critic.

We tried increasing the depth of pruning on keywords that have matches more than 20 by forcing those issues to have at least 2 other keywords existing.
- Still noisy.

We are doing search as for each keyword, look for issues!

We can turn this to, for each issue, look for matching keywords. Saving the number of matching keywords for each issue, we can extract the issues that have lots of matching

Per issue, these metrics can be helpful.

frequency for each keyword,(term-frequency)
having multiple keywords
the keyword being frequent among issues or not (inverse document frequency)

Essentially, using tf-idf techniques.

I will provide a summary on parsing and node structures as well as the keyword search system on wiki, after trying couple more things for reducing the noise tomorrow.

ersoykadir commented 1 year ago

For now, we produce a textual results for trace links. First line consists of the requirement number and requirement itself. Next line, the dictionary of extracted keywords and the number of issues that are matched with them. The rest of the lines are the issues that are matched with this requirement, containing issue number and issue title

Results of first trials can be seen in repo under trace results.

codingAku commented 1 year ago

Hello. I believe with the initial results on the repo, we can close the issue and continue working on the improvements. I didn't see any syntax error in Wiki as well. Thank you.

ersoykadir / Requirement-Traceability-Analysis