As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion - Githubissues

fao89 commented 2 years ago

Author: jsherril@redhat.com (jsherril@redhat.com)

Redmine Issue: 5286, https://pulp.plan.io/issues/5286

Problem

If you sync a file repository where one of the files is missing, it seems that the repository syncs as much as it can (as expected), but its reported as a fatal error, with a state of 'failed'.

This is especially problematic for remote repos that you cannot contact the maintainer, have the content you want, but don't have all content available (because it's an incomplete or corrupted repo).

Steps to reproduce:

1) create a file repository where one of the files is missing
2) create a file remote and repository and sync them

Actual task status (apologies its been yaml-fied):

- _href: "/pulp/api/v3/tasks/b6f9b619-c174-4e43-b546-0bbefdfb11e7/"
  _created: '2019-08-15T15:21:37.058+00:00'
  state: failed
  name: pulp_file.app.tasks.synchronizing.synchronize
  started_at: '2019-08-15T15:21:37.177+00:00'
  finished_at: '2019-08-15T15:21:37.382+00:00'
  non_fatal_errors: "[]"
  error:
    code: ''
    description: 404, message='Not Found'
    traceback: |2
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/worker.py", line 822, in perform_job
          rv = job.perform()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 605, in perform
          self._result = self._execute()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/rq/job.py", line 611, in _execute
          return self.func(*self.args, **self.kwargs)
        File "/usr/local/lib/pulp/src/pulp-file/pulp_file/app/tasks/synchronizing.py", line 45, in synchronize
          dv.create()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/declarative_version.py", line 169, in create
          loop.run_until_complete(pipeline)
        File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
          return future.result()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 209, in create_pipeline
          await asyncio.gather(*futures)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/api.py", line 43, in __call__
          await self.run()
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 132, in run
          pb.done += task.result()  # download_count
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/artifact_stages.py", line 155, in _handle_content_unit
          await asyncio.gather(*downloaders_for_content)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/stages/models.py", line 78, in download
          download_result = await downloader.run(extra_data=self.extra_data)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/base.py", line 212, in run
          return await self._run(extra_data=extra_data)
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/backoff/_async.py", line 131, in retry
          ret = await target(*args, **kwargs)
        File "/usr/local/lib/pulp/src/pulpcore-plugin/pulpcore/plugin/download/http.py", line 183, in _run
          response.raise_for_status()
        File "/usr/local/lib/pulp/lib64/python3.6/site-packages/aiohttp/client_reqrep.py", line 942, in raise_for_status
          headers=self.headers)
  worker: "/pulp/api/v3/workers/df7e0085-b0dd-4073-b74d-9ab78ad27a03/"
  spawned_tasks: []
  progress_reports:
  - message: Downloading Metadata
    state: completed
    total: 1
    done: 1
  - message: Parsing Metadata Lines
    state: completed
    total: 2
    done: 2
  - message: Downloading Artifacts
    state: failed
    done: 0
  - message: Associating Content
    state: canceled
    done: 0
  created_resources: []
  reserved_resources_record: []
create_version: true
poll_attempts:
  total: 1
  failed: 1

Solution

It would be useful to have an option that causes sync to not fail on download errors, but instead continue and record the errors as non-fatal exceptions somehow.

fao89 commented 2 years ago

From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T20:10:25Z

Pulp currently considers everything above a 400 status code a fatal exception. Which codes should be considered non-fatal? Is it a range of 400-499? a subset of those?

fao89 commented 2 years ago

From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-24T20:20:32Z

I was less concerned about the error code and more concerned that a single file in a file repo failing causing the entire repository to fail to sync, which doesn't seem right. I would think it would still sync everything else?

Although i guess you could have a situation where some dependent part of another piece failed to download (such as a docker blob associated with a manifest associated with a tag), which could leave you with a broken repo. However, maybe that is more expected than this behavior.

The idea of a single piece of broken content preventing EVERYTHING else from syncing still seems wrong to me though. How did pulp2 handle it i wonder?

fao89 commented 2 years ago

From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T20:30:02Z

Pulp 2 appended and error and kept going. A similar pattern is possible in Pulp 3. The ArtifactDownloader stage needs to make use of the Task.append_non_fatal_error() when an exception is raised by a download.

fao89 commented 2 years ago

From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-24T21:20:02Z

https://github.com/pulp/pulpcore-plugin/pull/135/

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2019-09-26T18:35:56Z

Currently the downloaders raise an exception on 404 errors, and I think those are fatal errors. I believe it's important for the sync machinery to have an exact copy of the data claimed in the RepositoryVersion. There is some retry logic already in the downloaders, but if the server responds via http response saying it doesn't have a file, we don't retry.

I expect 404s to be rare since content in the index should be available. How are you experiencing it? What is the impact if we leave this as NOTABUG or WONTFIX?

Related, we need to make the 404 error friendlier, but that would be a separate piece of work for the fatal exception handler to know about some types.

As an aside, I can't think of a use for non-fatal errors. We should remove the non-fatal exception interface because it's unused and we could add it later.

fao89 commented 2 years ago

From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:02:52Z

bmbouters, i will think about this a bit,

is it also a fatal error if one of the artifact download times out, or has a checksum mismatch?

fao89 commented 2 years ago

From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-26T20:12:59Z

Checksum mismatch is a fatal error. I think the connection timeout is a fatal error also (but will have to double check).

We have some retry wiht backoff behavior for when we receive a 429 response code.

fao89 commented 2 years ago

From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:17:32Z

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

I still think we need an option to treat these kind of errors as non_fatal.

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2019-09-26T20:39:47Z

jsherril@redhat.com wrote:

Thinking about this a bit more. Lets say i wanted to mirror all of ansible galaxy. If just one collection was missing on the filesystem (maybe because it had been pulled for a security reason), i can't sync anything. (this was a real thing with puppet modules, where puppet forge had broken modules in their repository).

It also would mean (if i'm thinking clearly), that an on_demand sync would differ from an 'immediate' sync. Meaning, on_demand would be more graceful at handling missing files.

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

I still think we need an option to treat these kind of errors as non_fatal.

fao89 commented 2 years ago

From: jsherril@redhat.com (jsherril@redhat.com) Date: 2019-09-26T20:41:49Z

I think thats an interesting idea. As long as its obvious that this problem occurred (via non-fatal errors?), and that re-syncing it would cause it to re-attempt the download of it, I like that solution.

fao89 commented 2 years ago

From: gmbnomis (gmbnomis) Date: 2019-09-26T21:59:44Z

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

fao89 commented 2 years ago

From: dkliban@redhat.com (dkliban@redhat.com) Date: 2019-09-27T13:48:32Z

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

I agree. The user can always choose to sync content on demand. Pulp should not change the policy for a subset of the content.

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2019-09-27T14:04:56Z

gmbnomis wrote:

bmbouter wrote:

Actually this got me thinking that in cases where 404 is returned, it could fallback to a lazy config for that content unit and continue. This would give more reliability when content remotes are unreliable. Is this best default behavior?

Just my two cents: I also thought about proposing this behavior and quickly rejected it.

If I specify "immediate" policy for a sync, I expect to just get that. If something fails (be it fatal or non-fatal), my expectation is that this can only be rectified by another sync. But I don't want Pulp to be "smarter" than the policy I specified by trying to get some artifacts later when I don't expect it and can't control it.

@gmbnomis, I also agree. What we have now adheres to this expectation.

The non_fatal exceptions part would also be removed since no one uses them as part of this issue: https://pulp.plan.io/issues/5442 Please comment on 5442 if you have thoughts on that.

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2019-09-27T17:59:42Z

After talking with @jsherrill and @mccune we determined it's ok for 3.0 to fail when policy='immediate'. In the future (maybe 3.1) core will need to provide plugin writers the ability to have their sync's "continue". This needs to be done in coordination with core and plugins.

Since no chnage is needed for now I'm removing from the sprint. @jsherrill lmk if you this isn't accurate.

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2019-09-27T18:00:13Z

Moving to core to have core enable this.

fao89 commented 2 years ago

From: bmbouter (bmbouter) Date: 2020-01-15T17:00:14Z

Identified as P2 at katello checkin meeting by @jsherrill

fao89 commented 2 years ago

From: dalley (dalley) Date: 2021-06-25T00:14:00Z

Supporting this properly will likely require refactoring DeclarativeVersion. We need to do that eventually to better support our metadata mirroring feature, so this would be a natural thing to consider along with that work.

fao89 commented 2 years ago

From: dalley (dalley) Date: 2021-06-25T00:20:12Z

I'm attaching a partial patch to streamline the process when we pick this back up

fao89 commented 2 years ago

From: ipanova@redhat.com (ipanova@redhat.com) Date: 2021-06-25T09:29:53Z

adding also link to the PR so the discussions are also visible https://github.com/pulp/pulpcore/pull/1427

fao89 / pulpcore

As a user, a Remote should provide an option that allows download errors, e.g. 404 errors to still allow creation of a RepositoryVersion #75

Problem

Solution