coreos / coreos-assembler

Tooling container to assemble CoreOS-like systems
https://coreos.github.io/coreos-assembler/
Apache License 2.0
335 stars 165 forks source link

src/cosalib/aliyun.py: save replication progress in meta.json #3839

Open mike-nguyen opened 1 month ago

mike-nguyen commented 1 month ago

Previously meta.json would only be written if replication was successful in all regions. If there were any errors during replication meta.json would not be written. Subsequent replication runs would start over from scratch which would not be a problem except Aliyun will error out if there is already an image with the same name. Re-running the operation on the same build was not possible without deleting the images before hand.

Lets write meta.json after every region replication so we can keep track of the replicated regions and allow re-runs.

mike-nguyen commented 1 month ago

/test rhcos

mike-nguyen commented 1 month ago

/test rhcos

openshift-ci[bot] commented 1 month ago

@mike-nguyen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos ceb5404eb24f8c98cc39bf2f572c5f1bc9319738 link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
mike-nguyen commented 1 month ago

Looks like https://github.com/openshift/os/issues/1551 for CI issues

jlebon commented 1 month ago

Hmm, it looks like this and https://github.com/coreos/fedora-coreos-pipeline/pull/1017 are trying to address https://github.com/coreos/coreos-assembler/issues/3634. But I think the proper fix is to make ore aliyun copy-image itself idempotent. E.g. if it finds an existing image of the same name, then it should just no-op and return the ID of the existing image. This is similar to what's done for AWS.

Right now for example, if the Publish step fails in the pipeline, we're still in a situation where we have missing data in meta.json.

The bit in aliyun.py and aws.py where we filter out regions that were already uploaded should be more of an optimization rather than load-bearing.

mike-nguyen commented 1 month ago

The bit in aliyun.py and aws.py where we filter out regions that were already uploaded should be more of an optimization rather than load-bearing.

Thanks! I missed that there was more going on behind the scenes on the AWS side. I'll get the Aliyun side to function similarly.