Data Collection for deep AQAtik

LongyuZhang commented 3 years ago

To automate the data collection process for deep AQAtik, we need to investigate and work on the following functions:

[x] Collect all open issue contents in related repos, e.g. openjdk-tests/issues
[x] Based on the issue contents, collect corresponding original test outputs from TRSS database or Jenkins output if exist.
[x] After storing all existing issue contents, continuously monitoring and collecting new issues in these repos.
[ ] Link data collection with ml model training program, so when a new issue is created, we can trigger ml training in needed.

Relate Issue: https://github.com/adoptium/aqa-test-tools/issues/355

LongyuZhang commented 3 years ago

FYI @smlambert @llxia

LongyuZhang commented 3 years ago

FYI @avishreekh

avishreekh commented 3 years ago

Thank you @LongyuZhang!

Collect all open issue contents in related repos, e.g. openjdk-tests/issues

We can use the issues API for listing issues of a repository (here) provided by GitHub.

After storing all existing issue contents, continuously monitoring and collecting new issues in these repos.

For collecting new issues, we could save the last updated timestamp when querying for new issues. We could then use this timestamp with the issues API for fetching new issues next time (It allows to fetch issues created/updated after a certain time using the since parameter). So we maintain a variable that stores the latest timestamp and use it for new queries.

Please let me know your thoughts on this @LongyuZhang @llxia @smlambert. Thank you!

llxia commented 3 years ago

Talked with @LongyuZhang , below are some of the details:

We should query git repos at an appropriate frequency (every 30 mins?).

relationship should be store in DB (i.e., MongoDB)

[ { "url": "https://api.github.com/repos/octocat/Hello-World/issues/1347",
"repository_url": "https://api.github.com/repos/octocat/Hello-World",
"number": 1347,
"state": "open",
"title": "Found a bug",
"created_at": "2011-04-10T20:09:31Z",
"updated_at": "2014-03-03T18:58:10Z",
"issue_content_path": "/path to the content file/issueContent/<repo name>_<issue#>.txt"
"test_output_path": "/path to the content file/testOutput/<repo name>_<issue#>.txt"
},....
]

text data (git issue content file, test output file) should be stored on the file system
we can use since to limit git query for issues created/updated after a certain time that matches with our query internals.
optional: If data is too large, we can use label to narrow down the search. (i.e., label="test failure")

In summary:

Step1: figure out git query using since Step2: write a query to query git periodically Step3: filter returned data into issue content and test output and store files in the file system Step4: store the relationship and data into DB. If an issue is updated, the data in DB should be updated accordingly Step5: trigger ml model training program to read /path to the content file/testOutput/<repo name>_<issue#>.txt

avishreekh commented 3 years ago

Thank you for the elaborate discussion @llxia. Please let me know if I can work on this.

llxia commented 3 years ago

Please go ahead. Thanks a lot for working on this!

avishreekh commented 3 years ago

I was wondering if we could use GitHub webhooks instead of polling using APIs. That way, we will be religiously notified when a new issue is added and we won't have to keep tracking it.

Please let me know your thoughts on this.

Thank you

LongyuZhang commented 3 years ago

It is a good idea to use GitHub webhooks to monitor new issues, but for the initial collection of existing issues, the issue api may work better. We can try to use them separately for these two purposes if possible. Thanks.

llxia commented 3 years ago

I agree, since we need to query multiple repos, I think git API is more flexible/easy. I think it is a good idea to keep an eye on alternatives (i.e., webhook, github workflow, etc), so we know what are the advantages and disadvantages of using them.

avishreekh commented 3 years ago

Thank you @LongyuZhang @llxia! I will first try to implement the initial collection of issues using the Issues API and poll for new issues using the since parameter. The Webhook integration can be done later if it is found to be a better alternative. I will also look for other alternatives in the meantime.

Please let me know if this sounds like a good strategy to begin with or if any modifications are needed.

Thank you.

LongyuZhang commented 3 years ago

Sounds good! Thanks @avishreekh

llxia commented 3 years ago

For now, we are querying git for issues. But please keep in mind, we may not limit to git issues. It could be other bug-tacking systems.

adoptium / aqa-test-tools

Data Collection for deep AQAtik #412