adoptium / aqa-test-tools

Home of Test Results Summary Service (TRSS) and PerfNext. These tools are designed to improve our ability to monitor and triage tests at the Adoptium project. The code is generic enough that it is extensible for use by any project that needs to monitor multiple CI servers and aggregate their results.
Apache License 2.0
28 stars 81 forks source link

Data Collection for deep AQAtik #412

Open LongyuZhang opened 3 years ago

LongyuZhang commented 3 years ago

To automate the data collection process for deep AQAtik, we need to investigate and work on the following functions:

Relate Issue: https://github.com/adoptium/aqa-test-tools/issues/355

LongyuZhang commented 3 years ago

FYI @smlambert @llxia

LongyuZhang commented 3 years ago

FYI @avishreekh

avishreekh commented 3 years ago

Thank you @LongyuZhang!

Collect all open issue contents in related repos, e.g. openjdk-tests/issues

We can use the issues API for listing issues of a repository (here) provided by GitHub.

After storing all existing issue contents, continuously monitoring and collecting new issues in these repos.

For collecting new issues, we could save the last updated timestamp when querying for new issues. We could then use this timestamp with the issues API for fetching new issues next time (It allows to fetch issues created/updated after a certain time using the since parameter). So we maintain a variable that stores the latest timestamp and use it for new queries.

Please let me know your thoughts on this @LongyuZhang @llxia @smlambert. Thank you!

llxia commented 3 years ago

Talked with @LongyuZhang , below are some of the details:

We should query git repos at an appropriate frequency (every 30 mins?).

In summary:

Step1: figure out git query using since Step2: write a query to query git periodically Step3: filter returned data into issue content and test output and store files in the file system Step4: store the relationship and data into DB. If an issue is updated, the data in DB should be updated accordingly Step5: trigger ml model training program to read /path to the content file/testOutput/<repo name>_<issue#>.txt

avishreekh commented 3 years ago

Thank you for the elaborate discussion @llxia. Please let me know if I can work on this.

llxia commented 3 years ago

Please go ahead. Thanks a lot for working on this!

avishreekh commented 3 years ago

I was wondering if we could use GitHub webhooks instead of polling using APIs. That way, we will be religiously notified when a new issue is added and we won't have to keep tracking it.

Please let me know your thoughts on this.

Thank you

LongyuZhang commented 3 years ago

It is a good idea to use GitHub webhooks to monitor new issues, but for the initial collection of existing issues, the issue api may work better. We can try to use them separately for these two purposes if possible. Thanks.

llxia commented 3 years ago

I agree, since we need to query multiple repos, I think git API is more flexible/easy. I think it is a good idea to keep an eye on alternatives (i.e., webhook, github workflow, etc), so we know what are the advantages and disadvantages of using them.

avishreekh commented 3 years ago

Thank you @LongyuZhang @llxia! I will first try to implement the initial collection of issues using the Issues API and poll for new issues using the since parameter. The Webhook integration can be done later if it is found to be a better alternative. I will also look for other alternatives in the meantime.

Please let me know if this sounds like a good strategy to begin with or if any modifications are needed.

Thank you.

LongyuZhang commented 3 years ago

Sounds good! Thanks @avishreekh

llxia commented 3 years ago

For now, we are querying git for issues. But please keep in mind, we may not limit to git issues. It could be other bug-tacking systems.