apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.11k stars 1.11k forks source link

List Events returns intermittent SQL exception.Fixed listEvents intermittent exception. #9661

Closed mprokopchuk closed 2 months ago

mprokopchuk commented 2 months ago

Need to check array empty, because com.cloud.utils.db.GenericDaoBase#searchAndCount(SearchCriteria, Filter, boolean) makes two calls: first to get objects and second to get count. List events has start date filter, there is highly possible cause where no objects loaded and next millisecond new event added and finally we ended up with count = 1 and no ids.

Finally user gets back response exception:

com.mysql.cj.jdbc.ClientPreparedStatement: SELECT event_view.id, event_view.uuid, event_view.type, event_view.state, 
event_view.description, event_view.created, event_view.user_id, event_view.user_name, event_view.level, 
event_view.start_id, event_view.start_uuid, event_view.parameters, event_view.account_id, event_view.account_uuid, 
event_view.account_name, event_view.account_type, event_view.domain_id, event_view.domain_uuid, 
event_view.domain_name, event_view.domain_path, event_view.resource_id, event_view.resource_type, 
event_view.project_id, event_view.project_uuid, event_view.project_name, event_view.archived, event_view.display 
FROM event_view WHERE event_view.id IN )

Example call: /client/api?command=listEvents&startdate=2024-07-16+19%3A19%3A16&listall=true&level=INFO&pagesize=1&page=1&expires=XXX&apikey=YYY&response=json&signatureVersion=3&signature=ZZZ

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Tested manually.

mprokopchuk commented 2 months ago

@blueorangutan test

sureshanaparti commented 2 months ago

@blueorangutan package

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 15.77%. Comparing base (501d8c1) to head (175a020). Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...ain/java/com/cloud/api/query/QueryManagerImpl.java 0.00% 1 Missing and 1 partial :warning:
...ain/java/com/cloud/event/dao/EventJoinDaoImpl.java 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #9661 +/- ## ============================================ - Coverage 15.77% 15.77% -0.01% Complexity 12538 12538 ============================================ Files 5621 5621 Lines 491556 491558 +2 Branches 60227 60446 +219 ============================================ - Hits 77562 77561 -1 - Misses 405537 405539 +2 - Partials 8457 8458 +1 ``` | [Flag](https://app.codecov.io/gh/apache/cloudstack/pull/9661/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [uitests](https://app.codecov.io/gh/apache/cloudstack/pull/9661/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `4.05% <ø> (ø)` | | | [unittests](https://app.codecov.io/gh/apache/cloudstack/pull/9661/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `16.59% <0.00%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sureshanaparti commented 2 months ago

@blueorangutan package

blueorangutan commented 2 months ago

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

rohityadavcloud commented 2 months ago

@mprokopchuk to gate DoS on the CI system, only (some) approved users have access to the CI/BO based system. We first need to kick packaging and then kick testing. These two operations are separately to allow more flexibility on what/how we're testing.

@blueorangutan help

blueorangutan commented 2 months ago

@rohityadavcloud [SL] I understand these words: "help", "hello", "thanks", "package", "test" Test command usage: test [mgmt os] [hypervisor] [keepEnv] [qemuEv] [basicZone|securityGroups] Mgmt OS options: ['ol8', 'ol9', 'alma8', 'alma9', 'suse15', 'centos7', 'centos6', 'rocky8', 'ubuntu18', 'ubuntu22', 'ubuntu20', 'ubuntu24'] Hypervisor options: ['kvm-centos6', 'kvm-centos7', 'kvm-rocky8', 'kvm-ol8', 'kvm-ol9', 'kvm-alma8', 'kvm-alma9', 'kvm-ubuntu18', 'kvm-ubuntu20', 'kvm-ubuntu22', 'kvm-ubuntu24', 'kvm-suse15', 'vmware-55u3', 'vmware-60u2', 'vmware-65u2', 'vmware-67u3', 'vmware-70u1', 'vmware-70u2', 'vmware-70u3', 'vmware-80', 'vmware-80u1', 'vmware-80u2', 'vmware-80u3', 'xenserver-65sp1', 'xenserver-71', 'xenserver-74', 'xcpng74', 'xcpng76', 'xcpng80', 'xcpng81', 'xcpng82'] Note: when keepEnv is passed, you need to specify mgmt server os and hypervisor or use the matrix command. when qemuEv is passed, it will deploy KVM hyperviosr hosts with qemu-kvm-ev, else it will default to stock qemu. When basicZone and/or securityGroups are passed it will create a zone of the last type specified (default is Advanced) Package command usage: package [all(default value),kvm,xen,vmware,hyperv,ovm] - a comma separated list can be passed with package command to bundle the required hypervisor's systemVM templates. Not passing any argument will bundle all - kvm,xen and vmware templates.

Blessed contributors for kicking Trillian test jobs: ['rohityadavcloud', 'shwstppr', 'vishesh92', 'Pearl1594', 'harikrishna-patnala', 'nvazquez', 'DaanHoogland', 'weizhouapache', 'borisstoyanov', 'vladimirpetrov', 'kiranchavala', 'andrijapanicsb', 'NuxRo', 'rajujith', 'alexandremattioli', 'sureshanaparti', 'abh1sar']

mprokopchuk commented 2 months ago

@mprokopchuk to gate DoS on the CI system, only (some) approved users have access to the CI/BO based system. We first need to kick packaging and then kick testing. These two operations are separately to allow more flexibility on what/how we're testing.

Sorry, @rohityadavcloud I didn't know that. Thank you for explanation!

blueorangutan commented 2 months ago

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11068

vishesh92 commented 2 months ago

@blueorangutan test

blueorangutan commented 2 months ago

@vishesh92 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan commented 2 months ago

[SF] Trillian test result (tid-11438) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 54696 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9661-t11438-kvm-ol8.zip Smoke tests completed. 140 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 398.38 test_vpc_redundant.py
rohityadavcloud commented 2 months ago

@JoaoJandre since this is a bugfix, causing sql exceptions intermittent can we consider it?

JoaoJandre commented 2 months ago

@JoaoJandre since this is a bugfix, causing sql exceptions intermittent can we consider it?

Yes, bugfixes are fine, but the author must describe the tests done; and ideally some 3rd party should test as well.

rohityadavcloud commented 2 months ago

@JoaoJandre thanks, we've asked @vishesh92 to help with QA.

vishesh92 commented 2 months ago

Since it's hard to reproduce the error, I used a debugger and modified the value of count for testing. This is the scenarios I tested:

  1. To reproduce the error, make a request to list events such that there are no results. Add a debugger where we are checking if the count == 0, and modify the value of count to 1. This results in the exception:
    com.mysql.cj.jdbc.ClientPreparedStatement: SELECT event_view.id, event_view.uuid, event_view.type, event_view.state, 
    event_view.description, event_view.created, event_view.user_id, event_view.user_name, event_view.level, 
    event_view.start_id, event_view.start_uuid, event_view.parameters, event_view.account_id, event_view.account_uuid, 
    event_view.account_name, event_view.account_type, event_view.domain_id, event_view.domain_uuid, 
    event_view.domain_name, event_view.domain_path, event_view.resource_id, event_view.resource_type, 
    event_view.project_id, event_view.project_uuid, event_view.project_name, event_view.archived, event_view.display 
    FROM event_view WHERE event_view.id IN )
  2. With the patch, I tried the same and I didn't get the error.
vishesh92 commented 2 months ago

@mprokopchuk Since this issue is present in 4.19 branch as well, can you rebase your PR against 4.19 branch?

rohityadavcloud commented 2 months ago

Merging based on Joao's comment: https://github.com/apache/cloudstack/pull/9661#issuecomment-2343561539 and 2 lgtm, manual QA and smoketests. As this causes sql error, it's considered critical for 4.20.