eclipse-equinox / p2

Eclipse Public License 2.0
15 stars 41 forks source link

Try to restart failed download on installation #292

Open iloveeclipse opened 1 year ago

iloveeclipse commented 1 year ago

See https://github.com/eclipse-platform/eclipse.platform.releng.aggregator/issues/1075, especially this comment that indicates that our automated installation fails often because of instability of download.eclipse.org server.

It would be nice if we fail to download some artifact for installation, to retry this operation few times. This would help us to get stable SDK test publishing, without me every morning checking and re-triggering collectResults job.

@merks : is this the area of p2 you may be familiar with?

merks commented 1 year ago

To me the problem

The artifact file for osgi.bundle,org.eclipse.e4.core.di,1.9.0.v20230429-1914 was not found.

isn't caused by a failure to download the artifact but a failure to find the artifact metadata for that artifact key.

One fundamental concern here is that these update sites are not necessarily "stable"

Especially the first one can be changing its contents on the fly at any point in time and I'm not sure how "atomic" those update to the server actually are. Also, the server will often serve up cached content for a while; at least that was my past experience. So one might see a newer content.jar but an older artifacts.jar (or vice versa) causing exactly this type of problem.

I'm not sure what p2 can really retry here... It's not as if anything failed in the transport layers. Is just inconsistency either at the time of the requests or as served up by the server for a short period of time.

iloveeclipse commented 1 year ago

Especially the first one can be changing its contents on the fly at any point in time

The collectResults job runs hours after we've created new SDK build. I assume the publishing to https://download.eclipse.org/eclipse/updates/4.29-I-builds/ does not need hours, so at the time collectResults job is executed everything should be "stable".

Also, the server will often serve up cached content for a while; at least that was my past experience.

How a "stale cache" can be a problem here, what could change for artifacts that are published once? Or do you mean, some (older) artifacts are deleted while we install?

@sravanlakkimsetti, @akurtakov : how do we "maintain" https://download.eclipse.org/eclipse/updates/4.29-I-builds/ - do we have some script / job that deletes old artifacts after uploading new one?

laeubi commented 1 year ago

I also see similar issues when an update-site is currently updated while a build is running as explained by @merks . That is because it is quite impossible to update a p2 site in an atomic way "on the fly", what one can do is:

  1. the site itself must be a composite
  2. upload the new content
  3. add it to the composite
  4. after a while (e.g. one day) delete the old content from the composite

still there is a small chance that compositeContent.xml is updated before compositeArtifacts.xml and a build has already one file and see stale content but its very small time-window.

Regarding caching the eclipse-servers last time do not respond very well in regards to caching but it is "intentional", also there is caching at P2 as well, and if Tycho is used there is also another caching...

akurtakov commented 1 year ago

Cleaning is done by https://ci.eclipse.org/releng/job/Cleanup/job/dailyCleanOldBuilds/ . I have never dug into the topic more so that's all the help I can provide here.

sravanlakkimsetti commented 1 year ago

@sravanlakkimsetti, @akurtakov : how do we "maintain" https://download.eclipse.org/eclipse/updates/4.29-I-builds/ - do we have some script / job that deletes old artifacts after uploading new one?

The contents are overwritten when you run the collectResults job.

Regarding the old builds, We have a cleanup script that deletes old builds leaving Monday's build in https://download.eclipse.org/eclipse/downloads/

In case of https://download.eclipse.org/eclipse/updates/4.29-I-builds/ we have last two successful builds as part of composite. the build adds new build and cleanup is done by https://github.com/eclipse-platform/eclipse.platform.releng.aggregator/blob/master/cje-production/cleaners/cleanupNightlyRepo.sh

iloveeclipse commented 1 year ago

The contents are overwritten when you run the collectResults job.

You mean local contents, not I-Build repo?

Cleaning is done by https://ci.eclipse.org/releng/job/Cleanup/job/dailyCleanOldBuilds/

This runs at 4 am / pm and shouldn't run in parallel at same time collectResults job runs / failed.

So if I see it right, the IBuild repo is not "touched" during collectResults job execution and the two other repos are "too old" to be updated by anyone in parallel. So the instability must be coming from download.eclipse.org server.

With that, we are back to question if we can do something in p2 land to handle instability of metadata/artifacts download server during installation?