crate / crate-qa

CrateDB Quality Assurance
4 stars 1 forks source link

Increase timeout per upgrade path to 15mins #308

Closed matriv closed 4 months ago

matriv commented 6 months ago

From recently added logs it seems that the longest upgrade path, (starting with 4.0.x), frequently consumes a bit more than 10 mins.

example failure: https://jenkins.crate.io/blue/organizations/jenkins/CrateDB%2Fqa%2Fcrate_qa/detail/crate_qa/793/pipeline

romseygeek commented 6 months ago
 for version_def in versions[1:]:
            timestamp = datetime.utcnow().isoformat(timespec='seconds')
            print(f"{timestamp} Upgrade to:: {version_def.version}")
            self.assert_data_persistence(version_def, nodes, digest, paths)
        # restart with latest version
        version_def = versions[-1]
        self.assert_data_persistence(version_def, nodes, digest, paths)

I've just realised, this is growing as O(n^2) isn't it - every time we add a new version, it tests an upgrade path to it from every previous supported version. Is it possible to move the timeout onto assert_data_persistance instead? Otherwise we're going to keep hitting this...

matriv commented 6 months ago
 for version_def in versions[1:]:
            timestamp = datetime.utcnow().isoformat(timespec='seconds')
            print(f"{timestamp} Upgrade to:: {version_def.version}")
            self.assert_data_persistence(version_def, nodes, digest, paths)
        # restart with latest version
        version_def = versions[-1]
        self.assert_data_persistence(version_def, nodes, digest, paths)

I've just realised, this is growing as O(n^2) isn't it - every time we add a new version, it tests an upgrade path to it from every previous supported version. Is it possible to move the timeout onto assert_data_persistance instead? Otherwise we're going to keep hitting this...

The timeout is on each upgrade path, e.g. from 4.0.x to latest, not for the outer loop for all upgrade paths. If we move it to assert_data_persistance then we exclude all the time spent during restarting the cluster, so we won't easily catch timeouts during these operations, which may hide issues. What do you think?

matriv commented 4 months ago

retest this please