awsdocs / aws-data-pipeline-developer-guide

The open source version of the AWS Data Pipeline documentation. To provide feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.
Other
16 stars 26 forks source link

Pipeline Components, Instances and Attempts #5

Closed jufemaiz closed 1 year ago

jufemaiz commented 6 years ago

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-tasks-scheduled.html

The documentation currently refers to "instances" but the accompanying image uses "tasks".

Pipeline Components, Instances, and Attempts

There are three types of items associated with a scheduled pipeline:

Pipeline Components — Pipeline components represent the business logic of the pipeline and are represented by the different sections of a pipeline definition. Pipeline components specify the data sources, activities, schedule, and preconditions of the workflow. They can inherit properties from parent components. Relationships among components are defined by reference. Pipeline components define the rules of data management.

Instances — When AWS Data Pipeline runs a pipeline, it compiles the pipeline components to create a set of actionable instances. Each instance contains all the information for performing a specific task. The complete set of instances is the to-do list of the pipeline. AWS Data Pipeline hands the instances out to task runners to process.

Attempts — To provide robust data management, AWS Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts. Attempt objects track the various attempts, results, and failure reasons if applicable. Essentially, it is the instance with a counter. AWS Data Pipeline performs retries using the same resources from the previous attempts, such as Amazon EMR clusters and EC2 instances.

Note

Retrying failed tasks is an important part of a fault tolerance strategy, and AWS Data Pipeline definitions provide conditions and thresholds to control retries. However, too many retries can delay detection of an unrecoverable failure because AWS Data Pipeline does not report failure until it has exhausted all the retries that you specify. The extra retries may accrue additional charges if they are running on AWS resources. As a result, carefully consider when it is appropriate to exceed the AWS Data Pipeline default settings that you use to control re-tries and related settings.

Data Pipeline Object Types

Ref: https://github.com/awsdocs/aws-data-pipeline-developer-guide/blob/master/doc_source/dp-how-tasks-scheduled.md

joshbean commented 1 year ago

Closing this issue or pull request in advance of archiving this repo. For more information about the decision to archive this repo (and others in the 'awsdocs' org), see the announcement on the AWS News Blog.