Open fao89 opened 2 years ago
From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T20:10:25Z
Pulp currently considers everything above a 400 status code a fatal exception. Which codes should be considered non-fatal? Is it a range of 400-499? a subset of those?
From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-24T20:20:32Z
I was less concerned about the error code and more concerned that a single file in a file repo failing causing the entire repository to fail to sync, which doesn't seem right. I would think it would still sync everything else?
Although i guess you could have a situation where some dependent part of another piece failed to download (such as a docker blob associated with a manifest associated with a tag), which could leave you with a broken repo. However, maybe that is more expected than this behavior.
The idea of a single piece of broken content preventing EVERYTHING else from syncing still seems wrong to me though. How did pulp2 handle it i wonder?
From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T20:30:02Z
Pulp 2 appended and error and kept going. A similar pattern is possible in Pulp 3. The ArtifactDownloader stage needs to make use of the Task.append_non_fatal_error() when an exception is raised by a download.
From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T21:20:02Z
From: bmbouter (bmbouter) Date: 2019-09-26T18:35:56Z
Currently the downloaders raise an exception on 404 errors, and I think those are fatal errors. I believe it's important for the sync machinery to have an exact copy of the data claimed in the RepositoryVersion. There is some retry logic already in the downloaders, but if the server responds via http response saying it doesn't have a file, we don't retry.
I expect 404s to be rare since content in the index should be available. How are you experiencing it? What is the impact if we leave this as NOTABUG or WONTFIX?
Related, we need to make the 404 error friendlier, but that would be a separate piece of work for the fatal exception handler to know about some types.
As an aside, I can't think of a use for non-fatal errors. We should remove the non-fatal exception interface because it's unused and we could add it later.
From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:02:52Z
bmbouters, i will think about this a bit,
is it also a fatal error if one of the artifact download times out, or has a checksum mismatch?
From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-26T20:12:59Z
Checksum mismatch is a fatal error. I think the connection timeout is a fatal error also (but will have to double check).
We have some retry wiht backoff behavior for when we receive a 429 response code.
From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:17:32Z
Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).
It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.
I still think we need an option to treat these kind of errors as non_fatal.
From: bmbouter (bmbouter) Date: 2019-09-26T20:39:47Z
jsherril@redhat.com wrote:
Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).
It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
I still think we need an option to treat these kind of errors as non_fatal.
From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:41:49Z
I think thats an interesting idea. As long as its obvious that this problem occurred (via non-fatal errors?), and that re-syncing it would cause it to re-attempt the download of it, I like that solution.
From: gmbnomis (gmbnomis) Date: 2019-09-26T21:59:44Z
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-27T13:48:32Z
gmbnomis wrote:
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
I agree. The user can always choose to sync content on demand. Pulp should not change the policy for a subset of the content.
From: bmbouter (bmbouter) Date: 2019-09-27T14:04:56Z
gmbnomis wrote:
bmbouter wrote:
Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?
Just my two cents: I also thought about proposing this behavior and quickly rejected it.
If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.
@gmbnomis, I also agree. What we have now adheres to this expectation.
The non_fatal exceptions part would also be removed since no one uses them as part of this issue: https://pulp.plan.io/issues/5442 Please comment on 5442 if you have thoughts on that.
From: bmbouter (bmbouter) Date: 2019-09-27T17:59:42Z
After talking with @jsherrill and @mccune we determined it's ok for 3.0 to fail when policy='immediate'. In the future (maybe 3.1) core will need to provide plugin writers the ability to have their sync's "continue". This needs to be done in coordination with core and plugins.
Since no chnage is needed for now I'm removing from the sprint. @jsherrill lmk if you this isn't accurate.
From: bmbouter (bmbouter) Date: 2019-09-27T18:00:13Z
Moving to core to have core enable this.
From: bmbouter (bmbouter) Date: 2020-01-15T17:00:14Z
Identified as P2 at katello checkin meeting by @jsherrill
From: dalley (dalley) Date: 2021-06-25T00:14:00Z
Supporting this properly will likely require refactoring DeclarativeVersion. We need to do that eventually to better support our metadata mirroring feature, so this would be a natural thing to consider along with that work.
From: dalley (dalley) Date: 2021-06-25T00:20:12Z
I'm attaching a partial patch to streamline the process when we pick this back up
From: ipanova@redhat.com (ipanova@redhat.com) Date: 2021-06-25T09:29:53Z
adding also link to the PR so the discussions are also visible https://github.com/pulp/pulpcore/pull/1427
Author: jsherril@redhat.com (jsherril@redhat.com)
Redmine Issue: 5286, https://pulp.plan.io/issues/5286
Problem
If you sync a file repository where one of the files is missing, it seems that the repository syncs as much as it can (as expected), but its reported as a fatal error, with a state of 'failed'.
This is especially problematic for remote repos that you cannot contact the maintainer, have the content you want, but don't have all content available (because it's an incomplete or corrupted repo).
Steps to reproduce:
1) create a file repository where one of the files is missing
2) create a file remote and repository and sync them
Actual task status (apologies its been yaml-fied):
Solution
It would be useful to have an option that causes sync to not fail on download errors, but instead continue and record the errors as non-fatal exceptions somehow.